DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

Published 17 Jun 2024 in cs.CV | (2406.11633v2)

Abstract: Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four key characteristics: 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes. 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA. 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.

Abstract PDF HTML Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper presents DocGenome, a novel benchmark of 500K scientific documents across 153 disciplines for evaluating multi-modal language models.
The paper introduces DocParser, an automated four-stage pipeline that annotates document structures with high accuracy and efficiency.
Experiments demonstrate that models like GPT-4V achieve improved performance on tasks such as classification, layout detection, and document transformation.

Introducing DocGenome Dataset

The paper presents DocGenome, an extensive, multi-modal scientific document benchmark focused on enhancing the training and evaluation of Multi-modal LLMs (MLLMs). DocGenome is constructed by auto-labeling 500K scientific documents from the arXiv community, spanning 153 disciplines. This benchmark stands out due to its comprehensive coverage of document components and complex entity relationships, offering a diverse range of tasks for evaluating MLLMs.

Figure 1: Overview of the DocGenome dataset, illustrating its multi-disciplinary scope and structured components termed as the document's genome.

DocParser: Automated Annotation Pipeline

DocParser is central to building the DocGenome dataset. It efficiently annotates scientific documents through a four-stage pipeline, handling data preprocessing, unit segmentation, attribute assignment, and relation retrieval, followed finally by color rendering. This automated annotation process is crucial for managing the vast corpus with high accuracy and low cost.

Figure 2: Schematic representation of the DocParser pipeline outlining the sequential stages for document annotation.

Benchmark Design and Analysis

DocGenome offers a rich dataset that includes varied discipline coverage, logical entity relationships, and robust quality control measures. The dataset features seven document-oriented tasks such as classification, visual grounding, and layout detection, among others. Extensive experiments demonstrate that DocGenome significantly improves model performance benchmarks, highlighting the dataset’s utility in both training and performance evaluation.

Experimental Setup and Results

Experiments reveal enhanced performance of MLLMs on DocGenome-test, with models like GPT-4V achieving notable accuracy in document classification and QA tasks. Scaling experiments show that increasing training data size leads to significant improvements across various tasks, particularly in layout detection and document transformation tasks.

Figure 3: Visualization examples of the seven tasks designed for evaluating DocGenome-test, showcasing the diverse multi-modal capabilities involved.

Further Opportunities and Generalization

The paper discusses potential applications of DocGenome in refining document-level tasks and enhancing entity relationship understanding in scientific corpora. The dataset’s adaptability in OOD scenarios is evaluated, demonstrating promising generalization capabilities. Future avenues include expanding document transformation techniques and leveraging entity relations extensively.

Conclusion

DocGenome represents a substantial leap in scientific document datasets, supporting AI research in multi-modal document understanding. Its comprehensive structure, coupled with advanced annotation techniques, positions it as a vital tool for researchers seeking to explore the boundaries of MLLM capabilities. The design and implementation of DocGenome aim at fostering innovation and improved understanding across varied scientific fields.