Papers
Topics
Authors
Recent
Search
2000 character limit reached

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Published 26 Sep 2025 in cs.CV and cs.CL | (2509.22186v2)

Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-LLM that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

Summary

  • The paper demonstrates a novel decoupled two-stage parsing strategy that mitigates token redundancy and improves efficiency in high-resolution document parsing.
  • It integrates a NaViT vision encoder, Qwen2-Instruct language model, and pixel-unshuffle patch merger for precise layout analysis and fine-grained content recognition.
  • Experimental evaluations show state-of-the-art performance on benchmarks, with notable improvements in speed and accuracy for tables, formulas, and overall layout detection.

MinerU2.5: A Decoupled Vision-LLM for Efficient High-Resolution Document Parsing

Introduction and Motivation

MinerU2.5 introduces a 1.2B-parameter vision-LLM (VLM) specifically designed for high-resolution document parsing. The model addresses the computational and semantic challenges inherent in document images, which typically exhibit high resolution, dense content, and complex layouts. Existing approaches—either modular pipelines or end-to-end VLMs—suffer from error propagation, hallucination, and severe inefficiency due to token redundancy. MinerU2.5's core innovation is a decoupled, coarse-to-fine two-stage parsing strategy that separates global layout analysis from local content recognition, enabling both high accuracy and computational efficiency.

Model Architecture and Two-Stage Parsing

MinerU2.5's architecture comprises a Qwen2-Instruct LLM (0.5B params), a NaViT vision encoder (675M params) supporting dynamic resolutions and aspect ratios, and a patch merger utilizing pixel-unshuffle for efficient token aggregation. The model leverages M-RoPE for improved positional encoding generalization.

The two-stage parsing strategy is as follows:

  • Stage I: Layout Analysis The input document is downsampled to 1036×10361036 \times 1036 pixels for rapid, global layout analysis. This stage identifies structural elements (text blocks, tables, formulas, images) with minimal computational cost.
  • Stage II: Content Recognition Guided by the detected layout, the model crops native-resolution regions from the original document and performs fine-grained recognition (text, table, formula) within these local windows. Crops are limited to 2048×28×282048 \times 28 \times 28 pixels to balance detail preservation and efficiency. Figure 1

    Figure 1: MinerU2.5's two-stage framework: global layout analysis on downsampled images, followed by fine-grained content recognition on native-resolution crops.

This decoupling reduces the O(N2)\mathcal{O}(N^2) token complexity of end-to-end approaches, mitigates hallucination, and allows independent optimization of each stage.

Data Engine and Training Pipeline

MinerU2.5's performance is underpinned by a comprehensive data engine that generates large-scale, high-quality training corpora. The pipeline consists of:

  1. Data Curation: Stratified sampling ensures diversity in layout, document type, element balance, and language.
  2. Pre-training Data Preparation: Automated annotation is refined using expert models for text, tables, and formulas.
  3. Fine-tuning Dataset Construction: The IMIC (Iterative Mining via Inference Consistency) strategy identifies hard cases via stochastic inference consistency metrics (PageIoU, TEDS, CDM), which are then manually curated. Figure 2

    Figure 2: Data engine workflow: curation, automated annotation refinement, and IMIC-driven hard case mining for fine-tuning.

Task Reformulation and Metric Innovations

Layout Analysis

MinerU2.5 introduces a hierarchical tagging system for layout elements, ensuring comprehensive coverage (including headers, footers, page numbers) and fine granularity (distinct tags for code, references, lists). Layout analysis is reformulated as a multi-task problem, jointly predicting position, class, rotation, and reading order.

The PageIoU metric is proposed for layout evaluation, measuring page-level coverage and aligning quantitative scores with qualitative human assessment. Figure 3

Figure 3: PageIoU metric: page-level coverage aligns with visual inspection, resolving inconsistencies in IoU-based recall.

Formula Recognition

MinerU2.5 employs the ADR (Atomic Decomposition and Recombination) framework, decomposing compound formulas into atomic lines, recognizing each into LaTeX, and structurally recombining the results. This approach addresses the limitations of monolithic formula recognition and reduces structural hallucination. Figure 4

Figure 4: ADR framework: compound formulas are decomposed, recognized, and recombined for high-fidelity LaTeX output.

Table Recognition

The model adopts a four-stage pipeline: detection, rotation correction, recognition into OTSL (Optimized Table Structure Language), and conversion to HTML. OTSL reduces structural token redundancy and sequence length, improving recognition of complex tables. Figure 5

Figure 5: Table recognition pipeline: detection, geometric normalization, OTSL generation, and HTML conversion.

Iterative Mining via Inference Consistency (IMIC)

IMIC leverages stochastic inference to identify hard cases—samples with low output consistency across multiple runs—using task-specific metrics. These cases are prioritized for manual annotation, focusing human effort on the most valuable data for model improvement. Figure 6

Figure 6: IMIC strategy: low-consistency samples in layout, table, and formula recognition are mined for targeted annotation.

Quantitative Evaluation

MinerU2.5 achieves state-of-the-art results across multiple benchmarks:

  • OmniDocBench: Overall score of 90.67, outperforming MonkeyOCR-pro-3B and dots.ocr. Best-in-class edit distance (0.047), formula CDM (88.46), table TEDS (88.22), and reading order edit distance (0.044).
  • Ocean-OCR: Lowest edit distance (0.033) and highest F1-score (0.945) on English documents; highest F1-score (0.965) and precision (0.966) on Chinese documents.
  • olmOCR-bench: Overall score of 75.2, leading in arXiv Math (76.6), Old Scans Math (54.6), and Long Tiny Text (83.5).
  • Layout Analysis: Top Full Page F1-score@PageIoU across OmniDocBench, D4LA, and DocLayNet.
  • Table Recognition: SOTA on FinTabNet, OCRBench v2, and in-house benchmarks; competitive on PubTabNet and CC-OCR.
  • Formula Recognition: SOTA on SCE, LaTeX-80MM^M, Fuzzy Math, and Complex datasets; second-best on CPE, HWE, SPE, and Chinese. Figure 7

    Figure 7: MinerU2.5's performance highlights on OmniDocBench: consistently surpasses general-purpose and domain-specific VLMs in text, formula, table, and reading order tasks.

Qualitative Analysis

MinerU2.5 demonstrates robust parsing across diverse document types (academic literature, books, reports, slides, newspapers, magazines), complex tables (rotated, merged cells, borderless, colored, dense), and intricate formulas (multi-line, nested, mixed-language, degraded). Figure 8

Figure 8: Layout and markdown output for academic literature, books, and textbooks.

Figure 9

Figure 9: Rendered outputs for various types of tables.

Figure 10

Figure 10: Rendered outputs for various types of formulas.

Comparisons with previous versions and other SOTA models show MinerU2.5's superior handling of rotated tables, merged cells, Chinese formulas, multi-line formulas, finer layout detection, and watermark pages. Figure 11

Figure 11: MinerU2.5 outperforms previous versions in rotated table recognition.

Figure 12

Figure 12: MinerU2.5 excels in formula recognition with Chinese content compared to prior models.

Deployment and Efficiency

MinerU2.5 is deployed via an optimized vLLM-based pipeline, with asynchronous backend and decoupled inference stages to minimize latency. Dynamic sampling penalties are applied based on detected layout types to suppress degenerate repetition. The model achieves 2.12 pages/s and 2337.25 tokens/s on A100 80G, outperforming MonkeyOCR-Pro-3B by 4×4\times and dots.ocr by 7×7\times in throughput.

Implications and Future Directions

MinerU2.5's decoupled architecture and data-centric training pipeline establish a new paradigm for efficient, high-fidelity document parsing. Its ability to rapidly convert unstructured documents into structured data is critical for curating pre-training corpora and enhancing RAG systems. The preservation of semantic integrity in tables, formulas, and layouts positions MinerU2.5 as a foundational tool for next-generation AI applications in knowledge extraction, information retrieval, and multimodal understanding.

Future work may explore further scaling of the data engine, integration with more advanced layout and semantic tagging systems, and adaptation to multilingual and domain-specific document types. The IMIC strategy offers a blueprint for continual model improvement via targeted annotation, potentially generalizable to other structured data extraction tasks.

Conclusion

MinerU2.5 demonstrates that a decoupled, coarse-to-fine VLM architecture, combined with a rigorous data engine and targeted hard case mining, can achieve state-of-the-art document parsing accuracy and efficiency. Its design resolves key trade-offs in high-resolution document understanding and sets a new standard for practical deployment in large-scale, real-world scenarios. The model's contributions extend beyond document parsing, offering substantial benefits for downstream AI systems reliant on structured document data.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces MinerU2.5, a smart computer system that can read and understand complex documents like PDFs and scanned pages. It focuses on “document parsing,” which means finding different parts of a page (like titles, paragraphs, tables, and formulas) and turning them into clean, editable text and structure. MinerU2.5 aims to be both very accurate and very fast, even on big, high-resolution pages with lots of content.

Key Questions the Paper Tries to Answer

Here are the main problems the researchers wanted to solve:

  • How can we read high-resolution, busy pages quickly without missing tiny details?
  • How can we avoid “hallucinations,” where the model makes up content that isn’t there?
  • How can we handle tricky parts of documents—like rotated tables, long multi-line formulas, and mixed languages—reliably?
  • How can we build a model that’s accurate but still small enough to run efficiently?

How MinerU2.5 Works

Think of parsing a page like exploring a giant picture:

  • Stage I is like looking at a zoomed-out map: the model analyzes a smaller version of the page to find the layout—where the titles, text blocks, tables, images, and formulas are located, and the order you should read them in.
  • Stage II is like using a magnifying glass: the model zooms into each important area at the original high resolution and reads the content carefully (text, formulas, tables), keeping tiny details intact.

The Model’s Parts (in simple terms)

  • Vision-LLM: A program that understands both images and text together.
  • Vision encoder: The “eyes” that turn picture patches into numbers the model can understand. MinerU2.5 uses NaViT, which can handle images of many sizes without squishing or warping them.
  • LLM: The “brain” that turns those numbers into words, LaTeX (for math), or table structures.
  • Patch merger: A smart compressor that groups nearby image patches so the system runs faster without losing important detail.

Why Two Stages?

Big pages have lots of empty space (margins, gaps) that waste time if you process the whole image at full resolution. The two-stage approach:

  • Saves time by only zooming in where needed.
  • Keeps the global “big picture” of how the page is organized.
  • Reduces mistakes and hallucinations.
  • Makes the system easier to improve, because layout and reading can be tuned separately.

Training and Data: How They Made It Good

The team built a careful “data engine” to train MinerU2.5:

  • Data curation: They selected a diverse set of documents in English and Chinese, balancing different layouts and types (papers, reports, books).
  • Pre-training: They used large amounts of automatically labeled data—then improved those labels using strong expert models for text, tables, and formulas.
  • Fine-tuning: They focused on “hard cases” (especially tricky pages) found using a method called IMIC. IMIC runs the model multiple times; if its answers don’t agree, that sample is likely challenging and worth careful human annotation.

Special Upgrades for Tough Tasks

MinerU2.5 doesn’t just do basic OCR. It has smart designs for specific hard problems:

  • Layout analysis: It predicts position, category, rotation angle, and reading order together in one go. It uses a unified tagging system that includes headers, footers, page numbers, references, lists, code blocks, captions, and more—so the page is well-structured for downstream use.
  • PageIoU metric: Instead of only using box-matching scores, PageIoU measures how well the predicted areas cover the page like a colored overlay—this aligns better with what humans consider “good” layout detection.
  • Formula recognition (ADR framework): It breaks long, multi-line formulas into smaller “atomic” pieces, recognizes each piece, then stitches them back together. This reduces confusion and keeps complex math correct.
  • Table recognition (OTSL): Instead of generating complex HTML directly, it first creates a simpler table-language called OTSL, which uses fewer tokens and matches table structure more directly. Then it converts OTSL to HTML.

Main Results and Why They Matter

MinerU2.5 achieves state-of-the-art accuracy on multiple benchmarks, including OmniDocBench. It beats large, general models (like Gemini and GPT-4o) and specialized OCR models on:

  • Text recognition
  • Formula recognition
  • Table recognition
  • Reading order prediction

It is also fast. Even with a relatively small size (about 1.2 billion parameters), it processes pages at a high speed and uses fewer computing resources compared to bigger models. On standard hardware, it reads more pages per second than several popular systems.

This means MinerU2.5 is both powerful and practical.

What This Means for the Real World

  • Faster digitization: It can turn PDFs and scans into clean, structured text and data more quickly.
  • Better search and RAG: Because it organizes the page reliably (including headers, footers, captions, references), systems that use documents to answer questions can find better information.
  • Reliable math and tables: Complex formulas and tricky tables become more accurate and easier to reuse or analyze.
  • Lower costs: Because it runs efficiently, organizations don’t need huge servers to get great results.
  • Fewer mistakes: The decoupled approach reduces hallucinations and makes results more trustworthy.

In short, MinerU2.5 shows that smart design—first finding the layout, then reading the details—can make document understanding both accurate and fast, helping schools, researchers, and businesses work with complex documents more easily.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, formulated to be actionable for future research:

  • Two-stage sensitivity and error propagation: No quantitative analysis of how Stage I layout errors (mislocalized boxes, wrong class, rotation, or reading order) degrade Stage II recognition; no uncertainty propagation, confidence calibration, or fallback strategies (e.g., auto-escalation to bigger crops or whole-page parsing).
  • Thumbnail and crop hyperparameters: The fixed Stage I thumbnail size (1036×1036) and Stage II crop upper bound (“2048 × 28 × 28”) are justified qualitatively but lack systematic tuning and sensitivity studies across diverse page sizes, densities, and layouts.
  • Ablations of architectural choices: Missing controlled ablations for key design decisions (NaViT vs. window attention; M-RoPE vs. 1D-RoPE; 2×2 pixel-unshuffle patch merger; dynamic repetition penalties; asynchronous scheduling) to quantify their individual impact on accuracy, speed, and stability.
  • Efficiency–accuracy trade-offs: No explicit scaling laws or curves showing token reduction vs. accuracy across page complexity, number of crops, and crop sizes; unclear if the approach still becomes impractical for extremely dense or long documents.
  • Multi-page and cross-page structures: Reading order and layout analysis appear page-bound; open question on handling cross-page elements (tables continued across pages, figure–caption separation, multi-page references), and document-level reading order.
  • Language coverage beyond Chinese–English: Unclear robustness to non-Latin scripts (Arabic/Hebrew RTL, Devanagari, Thai), vertical writing, mixed scripts, and low-resource languages; no multilingual benchmarks reported.
  • Handwritten and cursive content: No evaluation or strategy for handwritten notes, annotations, or mixed print–handwriting documents common in reports, forms, and archival scans.
  • Figures, charts, and chemical structures content: While categories exist in layout tags, there is no content extraction (e.g., chart data recovery, diagram-to-structured representation, caption–figure linking) evaluation or pipeline.
  • Table cell text OCR quality: Structure recognition (OTSL→HTML) is addressed, but end-to-end evaluation of cell text accuracy, cell alignment semantics, header hierarchies, merged cells, multi-line cell content, and numeric formatting is missing.
  • OTSL→HTML fidelity and edge cases: No quantitative assessment of conversion correctness for complex spanning patterns, nested headers, nested tables, footnotes anchored to cells, or rotated/vertical text within cells.
  • Formula ADR recombination fidelity: No evaluation of whether line-wise LaTeX recombination preserves mathematical semantics (alignment anchors, equation numbering/tags, environments, interline references, multi-column equations).
  • Metrics coverage and validation: PageIoU is proposed but lacks broader validation (correlation with human judgements, sensitivity to box granularity, class-specific analyses) and comparison to alternative layout metrics; formula CDM and table TEDS usage lacks explicit error taxonomies.
  • Reading order evaluation: The paper claims improvements but does not detail standardized metrics, benchmarks, or error analyses for reading order quality (especially in multi-column, figure–caption interleaving, or list/algorithm blocks).
  • Hallucination mitigation evidence: Claims of reduced hallucination are not backed by quantitative measures (e.g., hallucination rate per task, qualitative taxonomy, correlation with confidence estimates).
  • Robustness to real-world degradations: Although augmentations are defined, there is no stress testing under real-world conditions (camera-captured pages, skew, warping, folds, extreme blur/noise, glossy reflections) nor ablations on augmentation contributions; notably, spatial transforms are excluded for layout training without empirical justification.
  • IMIC hard-case mining validation: The stochastic consistency criterion may select noisy/ambiguous samples; there is no study of threshold selection, precision/recall of “hardness,” diversity of mined cases, or measured gains vs. random sampling; risk of overfitting to idiosyncratic ambiguities remains.
  • Data engine bias and provenance: Heavy reliance on proprietary/foundation models (Qwen2.5-VL-72B, Gemini-2.5 Pro) for refinement introduces label bias; no quantification of label noise, inter-model disagreement, or downstream bias; limited transparency on licensing, privacy, and release of training data for reproducibility.
  • Generalization across domains: Evaluation is largely on OmniDocBench; missing tests on business documents (invoices, receipts, forms), legal contracts, medical records, financial statements, historical archives, educational exams, and presentation slides in diverse formats.
  • Deployment on constrained hardware: Throughput is evaluated on high-end GPUs (A100/H200/4090 48G); no results on consumer GPUs with 8–24GB VRAM, CPU-only environments, mobile/edge devices, memory footprint, or energy consumption.
  • End-to-end latency accounting: vLLM startup overhead is excluded and pipeline latency breakdowns (cropping, rotation correction, I/O, scheduling) are not reported; lack of tail-latency and batch-size sensitivity analyses for production SLAs.
  • Confidence and quality control: No per-element confidence scores, calibration, or QA hooks to drive human-in-the-loop review, selective re-parsing, or automated rejection for low-confidence outputs.
  • Security and privacy: Use of commercially procured documents and human annotation raises unanswered questions on PII handling, anonymization, access controls, and compliance; absent threat-model or privacy-preserving training/inference strategies.
  • Downstream task impact: Claims of benefits for RAG and intelligent document analysis lack empirical demonstrations (e.g., RAG answer accuracy, retrieval recall, auditability) with parsed outputs vs. baselines.
  • Standardization of tags and interoperability: The proposed tagging system’s interoperability with existing schemas (e.g., DocAI, TEI, PDF standards) is not explored; migration pathways and tool support for broader adoption remain open.
  • Error taxonomy and qualitative analysis: No systematic breakdown of failure modes per task (layout, table, formula, text), their frequencies, and targeted mitigation strategies informed by error analyses.
  • Reproducibility of training recipe: Key training datasets (6.9M pre-training samples, 630K fine-tuning) and curation details are not fully public; open question on replicability of results with accessible data and models.

These gaps suggest concrete directions for future work, including rigorous ablations, multilingual and multi-page evaluations, comprehensive end-to-end metrics, confidence-calibrated pipelines, and broader domain coverage under realistic deployment constraints.

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now, supported by the paper’s demonstrated performance, open-source release, and deployment details.

  • Enterprise document ingestion and RAG pipelines (software, finance, legal, insurance)
    • What: High-throughput parsing of PDFs and scans into structured JSON/HTML, including reading order, tables (HTML/CSV via OTSL), and LaTeX for formulas; ingestion into vector databases and knowledge graphs.
    • Why now: Proven state-of-the-art accuracy and 2–4× throughput gains vs. strong baselines; decoupled layout→content recognition reduces hallucinations and token waste in long, high-res documents.
    • Tools/workflows: MinerU2.5 service (vLLM backend), ETL jobs, REST API for batch pages, table OTSL→HTML/CSV converter, formula LaTeX generator; dynamic sampling penalties tuned by element type as described.
    • Assumptions/dependencies: GPU inference (e.g., RTX 4090 48G, A100/H200) for best throughput; strongest results in Chinese/English; quality depends on input scan resolution; licensing of model/data must align with enterprise policies.
  • Contract analytics and compliance review (legal, finance, procurement)
    • What: Preserve headers/footers/page numbers and reading order for accurate clause extraction, cross-references, and redaction workflows.
    • Why now: The unified tagging system captures non-body elements often missed by pipelines; two-stage design improves reliability on dense, multi-column layouts.
    • Tools/workflows: Clause extraction + RAG; layout-aware redaction tool; audit trails from Stage I/II outputs.
    • Assumptions: Domain adaptation may be needed for non-standard forms/handwriting; privacy governance for on-prem deployment.
  • Scientific publishing and archive conversion (academia, publishing, open science)
    • What: PDF-to-LaTeX/HTML conversion with high-fidelity formula extraction (ADR), robust table parsing (OTSL), and clean reading order for long/complex papers.
    • Why now: Demonstrated breakthroughs in formula parsing and table recognition (rotated/borderless/partial borders).
    • Tools/workflows: Automated arXiv/publisher backfile conversion; LaTeX equation recovery; structured references and lists extraction.
    • Assumptions: Images/figures still require specialized figure/diagram understanding if semantic extraction is needed.
  • Accessibility enhancement for screen readers (education, public sector, publishing)
    • What: Generate linearized HTML with correct reading order, identify captions/footnotes, and convert formulae to MathML (via LaTeX) to improve assistive technologies.
    • Why now: Decoupled layout analysis yields reliable reading order and non-body element capture.
    • Tools/workflows: PDF→Accessible HTML pipeline; ARIA tagging postprocessor; alt-text LLM integration optional.
    • Assumptions: MathML generation step required (LaTeX→MathML converter); accessibility review still recommended.
  • Financial operations and back-office automation (finance, logistics, retail)
    • What: Structured extraction from invoices, statements, manifests, and catalogs, including robust tables even when rotated or borderless.
    • Why now: OTSL-based parsing halves sequence length vs HTML targets, improving accuracy and speed for large/complex tables.
    • Tools/workflows: OCR microservice returning JSON+CSV; confidence scoring via sampling consistency; exception routing.
    • Assumptions: Domain layouts vary; limited handwriting support; ensure currency/locale normalization.
  • E-discovery and records management (legal, public sector)
    • What: Bulk parsing of scanned records preserving headers/footers/page numbers; reading order reconstruction for accurate downstream search and summarization.
    • Why now: High page throughput enables timely case processing; PageIoU metric can monitor layout coverage quality at-scale.
    • Tools/workflows: Batch ingestion, PageIoU-based QC dashboards, legal hold pipelines.
    • Assumptions: Severe degradations (fax/noise) may require pre-cleaning; privacy and chain-of-custody constraints.
  • Rapid enterprise QA/summarization on policy and procedure documents (software, HR, operations)
    • What: Low-latency extraction of structured sections and lists to feed policy chatbots and compliance checks.
    • Why now: vLLM optimizations and decoupled stages reduce end-to-end latency and hallucination risk.
    • Tools/workflows: Section-aware summarizers; retrieval built on structured segments; per-element sampling penalties as in deployment section.
    • Assumptions: Guardrails still needed for downstream generation.
  • Localization/translation with layout preservation (localization, publishing)
    • What: Extract text by blocks with layout metadata and reflow post-translation while preserving math (via LaTeX) and table structures (OTSL/HTML).
    • Why now: Fine-grained tagging (captions, lists, code, algorithms) supports targeted translation policies.
    • Tools/workflows: CAT/MT pipeline with structure-preserving post-editing; formula passthrough.
    • Assumptions: Non-Latin scripts beyond Chinese/English may need adaptation; QA for typography/line-breaking.
  • Data curation and model improvement loops (ML ops in academia/industry)
    • What: Use IMIC to mine “hard cases” via inference consistency and prioritize human annotation; use Dingo-like QA tooling for efficient review.
    • Why now: IMIC is directly applicable with MinerU2.5 outputs; improves dataset quality without exhaustive labeling.
    • Tools/workflows: Active-learning pipeline with PageIoU (layout), TEDS (tables), CDM (formulas).
    • Assumptions: Annotation workforce; governance over source data; storage for intermediate artifacts.
  • Benchmarking and evaluation modernization (academia, standards bodies)
    • What: Adopt PageIoU to better align layout evaluation with human judgment at page level.
    • Why now: Clear mismatch of IoU-based metrics shown; PageIoU is simple to compute from coverage maps.
    • Tools/workflows: Replace/augment mAP dashboards; evaluate reading-order and rotation jointly per the multi-task paradigm.
    • Assumptions: Community/organization buy-in; metric standardization process.
  • Developer products and APIs (software)
    • What: Offer a MinerU2.5-powered “Document to Structured JSON” API; microservices for Table (OTSL→HTML/CSV) and Formula (LaTeX via ADR).
    • Why now: Open model and code available; straightforward deployment via vLLM with suggested scheduling parameters.
    • Tools/workflows: Containerized services; SDKs for Python/JS; batch and streaming endpoints.
    • Assumptions: GPU availability; cost modeling for throughput tiers.
  • Personal knowledge management and study aids (daily life, education)
    • What: Convert lecture PDFs/notes into searchable, structured notes; extract LaTeX for practice; clean tables to CSV for analysis.
    • Why now: Efficient on single-GPU workstations; robust on dense academic PDFs.
    • Tools/workflows: Desktop app with local inference; Obsidian/Notion importers.
    • Assumptions: Consumer GPUs for best UX; privacy-sensitive content stays local.

Long-Term Applications

These require further research, scaling to additional domains/languages, or engineering for broader deployment footprints.

  • Multilingual, handwriting, and low-resource scripts expansion (education, public sector, global enterprises)
    • What: Extend MinerU2.5 to diverse scripts (Arabic, Devanagari, Cyrillic, etc.) and handwriting-heavy documents.
    • Dependencies: Additional data curation via IMIC, targeted augmentation, and fine-tuning; potential specialized handwriting modules.
  • Mobile and edge deployment (software, field operations, logistics)
    • What: On-device parsing for mobile scanning apps and edge appliances.
    • Dependencies: Aggressive quantization, distillation of the 1.2B model, optimized runtimes (WASM/NNAPI), CPU/ARM kernels; memory-aware crop orchestration.
  • Domain-specific parsing of technical drawings and hybrid pages (engineering, energy, manufacturing)
    • What: Integrate layout+content parsing with figure/diagram understanding (CAD schematics, P&IDs, charts).
    • Dependencies: Additional vision modules for vector graphics/diagrams; expanded tagging schema and metrics.
  • Healthcare document understanding with regulatory compliance (healthcare)
    • What: Robust parsing for scanned medical records, lab reports, and guidelines; structure-preserving de-identification.
    • Dependencies: HIPAA/GDPR-compliant on-prem deployments; strong noise/scan robustness; medical vocabularies and forms; human-in-the-loop validation.
  • Chemistry and materials literature mining (pharma, materials)
    • What: Combine chemical_structure tags with formula and table parsing for reaction conditions and property tables.
    • Dependencies: Chemical OCR/diagram recognition integration; entity normalization; domain ontologies.
  • Standardization of evaluation metrics and schemas (policy, standards bodies)
    • What: Institutionalize PageIoU and unified layout tagging in procurement and compliance benchmarks.
    • Dependencies: Community consensus, reference implementations, inter-annotator agreement studies.
  • Autonomous document process orchestration (RPA + structured OCR)
    • What: End-to-end workflows that trigger actions (filing, approvals, payments) based on parsed structure and content.
    • Dependencies: Robust error detection via IMIC-like uncertainty; guardrails; integration with BPM/RPA systems.
  • Math-aware knowledge bases and search (academia, edtech)
    • What: Build large-scale, clean math corpora with faithful LaTeX and aligned text for math-aware retrieval and tutoring.
    • Dependencies: ADR scaled to very long derivations; LaTeX→semantic MathML conversion; evaluation datasets.
  • Privacy-preserving collaborative data engines (ML ops)
    • What: Federated IMIC pipelines to mine hard cases without centralizing sensitive documents.
    • Dependencies: Federated learning infrastructure; privacy-preserving metrics/aggregation; policy alignment.
  • Increased robustness to extreme degradations and layout anomalies (archives, government)
    • What: Handle torn pages, bleed-through, heavy skew, marginalia, stamps.
    • Dependencies: Augmentation and specialized pre-processing; adaptive crop strategies; uncertainty-aware fallbacks.
  • Cost-optimized cloud services at scale (software)
    • What: Multi-tenant services with dynamic batching, elastic scaling, and SLA tiers.
    • Dependencies: Scheduler advances beyond current vLLM tuning; autoscaling logic; cost/performance simulators.

Cross-Cutting Assumptions and Dependencies

  • Compute: Best-in-class throughput achieved on high-memory GPUs (4090 48G, A100/H200). CPU-only or low-VRAM scenarios need optimization.
  • Languages: Strongest results in Chinese/English; other languages and handwriting need data and fine-tuning.
  • Input quality: Very low-quality scans may require pre-processing (deskew, denoise).
  • Governance: Data licenses, PII handling, and regulatory compliance must be addressed for production.
  • Integration: Downstream systems benefit from the two-stage outputs (layout JSON + per-crop recognition), but require adapters into existing ETL, RPA, and search stacks.
  • Reliability: While the decoupled approach reduces hallucinations, production systems should incorporate confidence thresholds, IMIC-driven active learning, and human review for high-stakes use.

Glossary

  • 1D-RoPE: One-dimensional Rotary Position Embedding used to encode token positions in sequence models. "replace the original 1D-RoPE~\cite{su2024roformer} with M-RoPE~\cite{wang2024qwen2}"
  • 2D-RoPE: Two-dimensional Rotary Position Embedding for encoding spatial positions in image tokens. "and employs 2D-RoPE for positional encoding"
  • ADR (Atomic Decomposition and Recombination): A multi-stage formula recognition framework that decomposes compound formulas into atomic lines, recognizes them, and recombines the results structurally. "The Atomic Decomposition {paper_content} Recombination (ADR) Framework."
  • Atomic Formulas: Indivisible semantic units of mathematical expressions that form the basic components of compound formulas. "Atomic Formulas: The smallest, indivisible semantic units with a tight 2D topology (e.g., a single fraction, a matrix)."
  • CDM: A metric used to assess consistency or similarity of mathematical formula outputs across runs. "consistency is assessed using the pairwise CDM~\cite{wang2025image}"
  • Compound Formulas: Multi-line or composed mathematical expressions made from ordered atomic formulas. "Compound Formulas: An ordered set of atomic formulas composed vertically with specific alignment relationships (e.g., a multi-line derivation aligned at the equal signs)."
  • frequency_penalty: A decoding parameter that reduces repetition by penalizing tokens proportional to their frequency in the output. "we dynamically adjust sampling parameters like frequency_penalty and presence_penalty in Stage~II"
  • IMIC (Iterative Mining via Inference Consistency): A strategy to mine hard samples by measuring the consistency of multiple stochastic inference outputs. "we introduce the IMIC (Iterative Mining via Inference Consistency) strategy."
  • mAP (mean Average Precision): A standard object detection metric summarizing precision-recall performance across classes and thresholds. "Layout analysis is typically evaluated with object detection metrics like mAP"
  • M-RoPE: A modified Rotary Position Embedding variant enhancing generalization for varying resolutions/aspect ratios in vision-LLMs. "replace the original 1D-RoPE~\cite{su2024roformer} with M-RoPE~\cite{wang2024qwen2}"
  • NaViT: A Vision Transformer variant that supports dynamic, native image resolutions and aspect ratios. "we employ a 675M-parameter NaViT~\cite{dehghani2023patch} initialized from Qwen2-VL."
  • Native-resolution approaches: Methods that process images at their original resolution to preserve fine-grained details. "Native-resolution approaches~\cite{bai2025qwen2,guo2025seed1,dots.ocr,niu2025native} preserve fine-grained details"
  • OCR 2.0: A unified OCR paradigm that jointly handles text, formulas, tables, and charts within one framework. "pioneered the OCR 2.0 paradigm"
  • OTSL (Optimized Table-Structure Language): A compact intermediate representation for table structures that reduces token redundancy and sequence length. "we leverage the Optimized Table-Structure Language (OTSL)~\cite{lysak2023optimized}, an intermediate representation"
  • PageIoU: A page-level coverage metric measuring spatial consistency between predicted and ground-truth layouts. "we introduce PageIoU, a page-level coverage metric"
  • Patch Merger: A component that aggregates vision tokens (often with an MLP adaptor) to reduce token counts before feeding them to the LLM. "Only the two-layer MLP within the patch merger is trained"
  • Pixel-unshuffle: An operation that rearranges pixels to merge neighboring patches, lowering spatial resolution while increasing channel depth. "uses pixel-unshuffle~\cite{shi2016real} on adjacent 2 × 2 vision tokens"
  • presence_penalty: A decoding parameter that discourages reuse of previously generated tokens to increase output diversity. "we dynamically adjust sampling parameters like frequency_penalty and presence_penalty in Stage~II"
  • RAG (Retrieval-Augmented Generation): A technique where external retrieved documents augment a model’s context to improve generation. "Retrieval-Augmented Generation (RAG)~\cite{lin2024revolutionizing,zhang2024ocr,zhao2024retrieval}"
  • Reading order prediction: Determining the sequence in which document elements should be read to preserve semantic flow. "reading order prediction"
  • Swin-Transformer: A hierarchical vision transformer using shifted windows for efficient local attention. "Dolphin \citep{feng2025dolphin} employs a Swin-Transformer VLM"
  • TEDS (Tree-Edit-Distance-based Similarity): A similarity metric for table structure outputs based on tree edit distance. "we use the TEDS (Tree-Edit-Distance-based Similarity) score"
  • Token redundancy: Excess visual tokens from low-information regions that inflate computation without improving accuracy. "token redundancy, arising from large blank or low-information regions within the document image."
  • vLLM: A high-throughput inference and serving engine for LLMs. "We implement an efficient offline inference pipeline for MinerU2.5 based on vLLM~\citep{kwon2023efficient}."
  • Visual Instruction Tuning: Fine-tuning a model to follow natural-language instructions for visual tasks. "Visual Instruction Tuning."
  • Visual Question Answering (VQA): A task requiring models to answer questions about images, used here for modality alignment. "modality alignment training on Visual Question Answering (VQA) datasets."
  • Window attention: An attention mechanism restricted to local windows to improve efficiency on high-resolution inputs. "adopts window attention to improve efficiency"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 16 likes about this paper.