- The paper introduces PP-DocBee, a multimodal large language model based on the ViT+MLP+LLM paradigm for enhanced document image understanding.
- A novel data synthesis strategy is proposed to create PPInfinityDocData, a high-quality dataset specifically targeting Chinese document understanding with text, table, and chart types.
- Experimental results show PP-DocBee achieves state-of-the-art performance on English benchmarks (e.g., 83.5 on OCRBench with post-processing) and outperforms existing models on Chinese document understanding tasks.
The paper introduces PP-DocBee, a multimodal LLM designed for document image understanding. The model is based on the "ViT+MLP+LLM" paradigm, leveraging a Vision Transformer (ViT) for image processing, an MLP for feature integration, and a LLM for text understanding and generation.
Here's a breakdown:
- Data Synthesis Strategy: The work addresses limitations in existing open-source document datasets, such as the lack of Chinese corpora and uneven data quality, by introducing a document data synthesis strategy. This strategy generates a 477k high-quality dataset on Chinese document understanding named PPInfinityDocData.
- PPInfinityDocData: The PPInfinityDocData dataset includes text-rich documents, tables, and charts, with a focus on improving Chinese semantic understanding and expanding multimodal scenario coverage. The dataset consists of 288k text-rich documents, 26k tables, and 163k charts.
- Data Generation Pipeline: The data generation pipeline targets three document types: text-rich documents, tables, and charts, using a multimodal collaborative generation mechanism. This involves a cascaded processing architecture with a small Optical Character Recognition (OCR) model and a LLM and a rendering engine based on semantic control.
- Text-Rich Document QA Generation: The paper utilizes an OCR-LLM collaborative verification mechanism, employing PaddleOCR to extract layout structures and textual information. This output is integrated with the semantic understanding capabilities of a LLM (ERNIE-Bot 4.0) to correct OCR recognition errors and control the distribution of generated question-answer pairs.
- Chart Image Generation: The image generation uses LLMs to semantically modify chart parameters in code and translates the text presented on the image into Chinese, while keeping other parameters unchanged.
- Table QA Generation: The approach establishes a HTML-Table dual-modal alignment mechanism, using the original HTML table structure as baseline information to design hierarchical prompt templates.
- Model Overview: The model builds upon Qwen2-VL-2B, retaining the "ViT+MLP+LLM" architecture.
- Data Pre-processing: The model employs a patch-based approach, similar to ViT, dividing the image into multiple small patches. During training, the upper limit of resize thresholds is increased from 512 pixels to 768 pixels.
- Dynamic Ratio Sampling Training: The work uses a dynamic data ratio sampling mechanism to optimize the training process, assigning different sampling weights to different data and sources. This balances the quantitative differences between different datasets.
- OCR Post-Processing: The approach incorporates OCR-recognized text into the model's input during inference by adding a prompt to the original question.
The experimental results demonstrate that PP-DocBee achieves state-of-the-art performance on English document understanding benchmarks and outperforms existing open-source and commercial models in Chinese document understanding. Ablation studies validate the effectiveness of the data synthesis strategy and dynamic ratio sampling. PP-DocBee achieves a high score of 81.2 on the TextVQA task and 82.8 on the OCRBench task, which further improves to 83.5 with OCR post-processing assistance. On internal Chinese benchmarks, PP-DocBee-2B achieves a leading score of 517 in the "Printed text" category and an overall score of 765, demonstrating high comprehensive accuracy.