PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

Published 22 Jun 2025 in cs.CV, cs.AI, and cs.CL | (2506.18023v2)

Abstract: This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4\%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0\%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel data filtering strategy for multimodal document understanding that uses a large pre-trained model to evaluate sample quality and discards outliers.
The paper demonstrates that fusing visual features from middle layers of the Vision Transformer, rather than just the final layer, significantly enhances document understanding performance.
The paper presents inference optimizations that drastically reduce latency and inference time for PP-DocBee2 while maintaining high accuracy, enabling efficient real-world deployment.

PP-DocBee2: Enhanced Baselines and Data Efficiency for Multimodal Document Understanding

PP-DocBee2 presents a series of architectural and data-centric advancements for multimodal document understanding, with a focus on Chinese business documents. The model builds upon the Qwen2.5-VL-3B backbone and introduces innovations in data quality optimization, visual feature fusion, and inference efficiency. The reported results demonstrate substantial improvements in both accuracy and latency over previous baselines.

Data Quality Optimization

A central contribution of PP-DocBee2 is its data filtering strategy, which leverages a large-scale multimodal pre-trained model (e.g., Qwen2.5VL-7B) as a data evaluator. For each training sample, the evaluator computes the forward cross-entropy loss, serving as a proxy for semantic alignment difficulty across modalities. The loss distribution is modeled as approximately normal, and samples with loss values exceeding $\mu + 2\sigma$ are discarded. This approach systematically removes outliers—typically noisy or overly difficult samples—resulting in a cleaner, more learnable dataset.

Ablation studies confirm the efficacy of this filtering: the $2\sigma$ strategy yields the highest overall benchmark score (845), outperforming both unfiltered and other $\sigma$ -based filtering schemes. This demonstrates that careful data curation, guided by statistical criteria and strong evaluators, can significantly enhance model generalization and stability.

Visual Feature Fusion Strategy

PP-DocBee2 revisits the standard practice of using only the final output of the Vision Transformer (ViT) as the visual representation. Motivated by recent findings on the underutilization of intermediate features, the model decomposes the ViT into shallow, middle, and deep layers. It then fuses token embeddings from selected middle layers (notably layer 16) with the final layer output before projection into the LLM input space.

Empirical results indicate that selecting features from layer 16 yields the best overall performance (score of 852), surpassing both deep-layer-only and multi-layer averaging strategies. The ablation further reveals that the optimal layer selection is task-dependent: middle layers are more effective for printed text, while deeper layers benefit table understanding. This suggests that fine-grained, multi-scale visual cues from intermediate layers are critical for complex document reasoning tasks.

Inference Optimization

PP-DocBee2 addresses the high inference latency typical of MLLMs by introducing several optimizations:

Kernel fusion and efficient attention implementations reduce computational overhead.
Token sampling optimization further accelerates the auto-regressive decoding process.

The efficient version of PP-DocBee2 achieves a 73.0% reduction in inference time and a 48.6% decrease in end-to-end latency, with negligible impact on accuracy. This is particularly relevant for real-world deployment scenarios where throughput and responsiveness are critical.

Experimental Results

On internal Chinese business document benchmarks, PP-DocBee2-3B achieves the highest overall score (852), outperforming GPT-4o (685), Qwen2.5-VL-3B (789), and the original PP-DocBee-2B (765). The model demonstrates strong performance across printed text, tables, and charts, though performance on seals remains limited due to the small sample size.

The ablation tables provide clear evidence for the impact of both data filtering and layer selection strategies. Notably, the $2\sigma$ data filtering and layer 16 feature fusion consistently yield the best results across categories.

Implications and Future Directions

PP-DocBee2's results underscore the importance of data quality and architectural choices in multimodal document understanding. The use of strong foundation models as data evaluators introduces a scalable, model-agnostic approach to dataset curation, which could be extended to other domains and languages. The findings on intermediate feature fusion suggest that future multimodal models should more systematically explore the representational hierarchy of visual encoders, potentially via dynamic or task-adaptive fusion mechanisms.

From a deployment perspective, the demonstrated latency reductions make PP-DocBee2 suitable for enterprise-scale document processing pipelines, especially in high-throughput or interactive settings.

Potential future developments include:

Dynamic layer selection: Adapting the fusion strategy based on input characteristics or downstream task requirements.
Cross-lingual generalization: Extending the data filtering and fusion techniques to multilingual or cross-domain document understanding.
End-to-end differentiable data selection: Integrating the data evaluator into the training loop for joint optimization.

Conclusion

PP-DocBee2 advances the state of multimodal document understanding through principled data filtering, effective visual feature fusion, and practical inference optimizations. The model sets a new baseline for Chinese business document comprehension and provides a blueprint for future research on data-efficient, high-performance multimodal systems. The open-source release of code and models further facilitates adoption and benchmarking in both academic and industrial contexts.

Markdown Report Issue