Qwen2.5-VL Technical Report

Published 19 Feb 2025 in cs.CV and cs.CL | (2502.13923v1)

Abstract: We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel multimodal architecture that integrates a refined Vision Transformer with dynamic resolution processing.
The paper leverages multimodal rotary position embedding and an expanded data corpus growing from 1.2 trillion to 4.1 trillion tokens for enhanced training.
The paper demonstrates superior performance in visual recognition, object localization, and long-video comprehension through optimized inference and window attention.

Qwen2.5-VL Technical Report

The Qwen2.5-VL paper offers a comprehensive exploration of the Qwen 2.5 Vision-LLM's architecture, advancements, and applications. This advanced model demonstrates significant progress in multimodal understanding, especially in tasks demanding the integration of vision and language, such as visual recognition, object localization, and long-video comprehension.

Model Architecture

Qwen2.5-VL presents an intricate framework that combines a vision encoder and a LLM decoder to efficiently process images and videos. The architecture is designed to handle multimodal inputs dynamically, accommodating variations in both spatial resolution and temporal dynamics.

Vision Encoder

The Vision Encoder employs a refined Vision Transformer (ViT) with enhancements such as window-based attention and dynamic resolution processing. These features allow the model to process high-resolution inputs efficiently by splitting images into patches and dynamically adjusting the frame rates for video inputs.

Figure 1: Diagram illustrating the integration of the vision encoder and LLM decoder within the Qwen2.5-VL framework.

Multimodal Rotary Position Embedding

Qwen2.5-VL leverages Multimodal Rotary Position Embedding aligned to absolute time, optimizing the model's ability to comprehend temporal sequences. This enables precise event localization within videos over extended durations.

Innovations and Features

Dynamic Resolution and FPS Sampling

The introduction of native dynamic resolution processing allows Qwen2.5-VL to handle inputs without traditional normalization techniques. This innovation enables the model to accurately perceive spatial scales and temporal dynamics.

Enhanced Data Corpus

Qwen2.5-VL is trained using a curated dataset, expanding from 1.2 trillion to 4.1 trillion tokens. This corpus includes diverse multimodal data types, supporting robust training for tasks such as document parsing, video understanding, and interactive applications.

Performance Enhancements

By employing window attention and optimizing inference efficiency, Qwen2.5-VL reduces computational overhead, achieving superior performance across a range of multimodal tasks. The model outperforms state-of-the-art competitors in document and diagram understanding, indicating its capabilities in complex visual-linguistic scenarios.

Practical Implications

The advancements in Qwen2.5-VL set it apart as a versatile model suitable for applications from edge AI to high-performance computing. With sizes ranging from 72B to 3B parameters, it caters to diverse computational needs while maintaining robust linguistic performance.

Conclusion

Qwen2.5-VL signifies a substantial leap in vision-language modeling, pushing the boundaries of multimodal integration and understanding. It establishes new benchmarks in AI-driven perception and interaction, paving the way for more sophisticated applications in real-world environments. This technical report is instrumental for researchers and practitioners focusing on advancing AI's interaction with multimodal data.

Markdown Report Issue