- The paper introduces a novel data filtering strategy for multimodal document understanding that uses a large pre-trained model to evaluate sample quality and discards outliers.
- The paper demonstrates that fusing visual features from middle layers of the Vision Transformer, rather than just the final layer, significantly enhances document understanding performance.
- The paper presents inference optimizations that drastically reduce latency and inference time for PP-DocBee2 while maintaining high accuracy, enabling efficient real-world deployment.
PP-DocBee2: Enhanced Baselines and Data Efficiency for Multimodal Document Understanding
PP-DocBee2 presents a series of architectural and data-centric advancements for multimodal document understanding, with a focus on Chinese business documents. The model builds upon the Qwen2.5-VL-3B backbone and introduces innovations in data quality optimization, visual feature fusion, and inference efficiency. The reported results demonstrate substantial improvements in both accuracy and latency over previous baselines.
Data Quality Optimization
A central contribution of PP-DocBee2 is its data filtering strategy, which leverages a large-scale multimodal pre-trained model (e.g., Qwen2.5VL-7B) as a data evaluator. For each training sample, the evaluator computes the forward cross-entropy loss, serving as a proxy for semantic alignment difficulty across modalities. The loss distribution is modeled as approximately normal, and samples with loss values exceeding μ+2σ are discarded. This approach systematically removes outliers—typically noisy or overly difficult samples—resulting in a cleaner, more learnable dataset.
Ablation studies confirm the efficacy of this filtering: the 2σ strategy yields the highest overall benchmark score (845), outperforming both unfiltered and other σ-based filtering schemes. This demonstrates that careful data curation, guided by statistical criteria and strong evaluators, can significantly enhance model generalization and stability.
Visual Feature Fusion Strategy
PP-DocBee2 revisits the standard practice of using only the final output of the Vision Transformer (ViT) as the visual representation. Motivated by recent findings on the underutilization of intermediate features, the model decomposes the ViT into shallow, middle, and deep layers. It then fuses token embeddings from selected middle layers (notably layer 16) with the final layer output before projection into the LLM input space.
Empirical results indicate that selecting features from layer 16 yields the best overall performance (score of 852), surpassing both deep-layer-only and multi-layer averaging strategies. The ablation further reveals that the optimal layer selection is task-dependent: middle layers are more effective for printed text, while deeper layers benefit table understanding. This suggests that fine-grained, multi-scale visual cues from intermediate layers are critical for complex document reasoning tasks.
Inference Optimization
PP-DocBee2 addresses the high inference latency typical of MLLMs by introducing several optimizations:
The efficient version of PP-DocBee2 achieves a 73.0% reduction in inference time and a 48.6% decrease in end-to-end latency, with negligible impact on accuracy. This is particularly relevant for real-world deployment scenarios where throughput and responsiveness are critical.
Experimental Results
On internal Chinese business document benchmarks, PP-DocBee2-3B achieves the highest overall score (852), outperforming GPT-4o (685), Qwen2.5-VL-3B (789), and the original PP-DocBee-2B (765). The model demonstrates strong performance across printed text, tables, and charts, though performance on seals remains limited due to the small sample size.
The ablation tables provide clear evidence for the impact of both data filtering and layer selection strategies. Notably, the 2σ data filtering and layer 16 feature fusion consistently yield the best results across categories.
Implications and Future Directions
PP-DocBee2's results underscore the importance of data quality and architectural choices in multimodal document understanding. The use of strong foundation models as data evaluators introduces a scalable, model-agnostic approach to dataset curation, which could be extended to other domains and languages. The findings on intermediate feature fusion suggest that future multimodal models should more systematically explore the representational hierarchy of visual encoders, potentially via dynamic or task-adaptive fusion mechanisms.
From a deployment perspective, the demonstrated latency reductions make PP-DocBee2 suitable for enterprise-scale document processing pipelines, especially in high-throughput or interactive settings.
Potential future developments include:
- Dynamic layer selection: Adapting the fusion strategy based on input characteristics or downstream task requirements.
- Cross-lingual generalization: Extending the data filtering and fusion techniques to multilingual or cross-domain document understanding.
- End-to-end differentiable data selection: Integrating the data evaluator into the training loop for joint optimization.
Conclusion
PP-DocBee2 advances the state of multimodal document understanding through principled data filtering, effective visual feature fusion, and practical inference optimizations. The model sets a new baseline for Chinese business document comprehension and provides a blueprint for future research on data-efficient, high-performance multimodal systems. The open-source release of code and models further facilitates adoption and benchmarking in both academic and industrial contexts.