PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

Published 6 Mar 2025 in cs.CV, cs.AI, and cs.CL | (2503.04065v3)

Abstract: With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal LLM designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.

Abstract PDF Upgrade to Chat

Summary

The paper introduces PP-DocBee, a multimodal large language model based on the ViT+MLP+LLM paradigm for enhanced document image understanding.
A novel data synthesis strategy is proposed to create PPInfinityDocData, a high-quality dataset specifically targeting Chinese document understanding with text, table, and chart types.
Experimental results show PP-DocBee achieves state-of-the-art performance on English benchmarks (e.g., 83.5 on OCRBench with post-processing) and outperforms existing models on Chinese document understanding tasks.

The paper introduces PP-DocBee, a multimodal LLM designed for document image understanding. The model is based on the "ViT+MLP+LLM" paradigm, leveraging a Vision Transformer (ViT) for image processing, an MLP for feature integration, and a LLM for text understanding and generation.

Here's a breakdown:

Data Synthesis Strategy: The work addresses limitations in existing open-source document datasets, such as the lack of Chinese corpora and uneven data quality, by introducing a document data synthesis strategy. This strategy generates a 477k high-quality dataset on Chinese document understanding named PPInfinityDocData.
PPInfinityDocData: The PPInfinityDocData dataset includes text-rich documents, tables, and charts, with a focus on improving Chinese semantic understanding and expanding multimodal scenario coverage. The dataset consists of 288k text-rich documents, 26k tables, and 163k charts.
Data Generation Pipeline: The data generation pipeline targets three document types: text-rich documents, tables, and charts, using a multimodal collaborative generation mechanism. This involves a cascaded processing architecture with a small Optical Character Recognition (OCR) model and a LLM and a rendering engine based on semantic control.
Text-Rich Document QA Generation: The paper utilizes an OCR-LLM collaborative verification mechanism, employing PaddleOCR to extract layout structures and textual information. This output is integrated with the semantic understanding capabilities of a LLM (ERNIE-Bot 4.0) to correct OCR recognition errors and control the distribution of generated question-answer pairs.
Chart Image Generation: The image generation uses LLMs to semantically modify chart parameters in code and translates the text presented on the image into Chinese, while keeping other parameters unchanged.
Table QA Generation: The approach establishes a HTML-Table dual-modal alignment mechanism, using the original HTML table structure as baseline information to design hierarchical prompt templates.
Model Overview: The model builds upon Qwen2-VL-2B, retaining the "ViT+MLP+LLM" architecture.
Data Pre-processing: The model employs a patch-based approach, similar to ViT, dividing the image into multiple small patches. During training, the upper limit of resize thresholds is increased from 512 pixels to 768 pixels.
Dynamic Ratio Sampling Training: The work uses a dynamic data ratio sampling mechanism to optimize the training process, assigning different sampling weights to different data and sources. This balances the quantitative differences between different datasets.
OCR Post-Processing: The approach incorporates OCR-recognized text into the model's input during inference by adding a prompt to the original question.

The experimental results demonstrate that PP-DocBee achieves state-of-the-art performance on English document understanding benchmarks and outperforms existing open-source and commercial models in Chinese document understanding. Ablation studies validate the effectiveness of the data synthesis strategy and dynamic ratio sampling. PP-DocBee achieves a high score of 81.2 on the TextVQA task and 82.8 on the OCRBench task, which further improves to 83.5 with OCR post-processing assistance. On internal Chinese benchmarks, PP-DocBee-2B achieves a leading score of 517 in the "Printed text" category and an overall score of 765, demonstrating high comprehensive accuracy.

Markdown Report Issue