Kimi-VL Technical Report

Published 10 Apr 2025 in cs.CV | (2504.07491v3)

Abstract: We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-LLM (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Abstract PDF Upgrade to Chat

Summary

The paper presents an efficient multimodal VLM that excels in OCR, mathematical reasoning, and multi-image tasks using a compact 2.8B parameter MoE language decoder.
The paper details a hybrid training paradigm combining standalone ViT pre-training and joint image-text training phases, processing up to 4.4 trillion tokens.
The paper demonstrates practical improvements in handling extended context lengths with a 128K window and optimized parallel processing for enhanced multimodal reasoning.

Kimi-VL Technical Report

Abstract and Introduction

The Kimi-VL report introduces an efficient open-source Mixture-of-Experts (MoE) vision-LLM capable of advanced multimodal reasoning across diverse domains. The model leverages only 2.8 billion parameters in its language decoder while demonstrating proficiency in a variety of challenging multimodal tasks, including OCR, mathematical reasoning, and multi-image understanding. Alongside its general-purpose VLM capabilities, Kimi-VL surpasses some state-of-the-art models such as GPT-4o in several aspects, particularly in processing extensive contexts through its 128K extended context window. These advantages position Kimi-VL as a promising tool in efficient multimodal thinking and reasoning.

Model Architecture

The core architecture of Kimi-VL consists of three integral components: MoonViT, an MLP projector, and a MoE LLM. MoonViT acts as a native-resolution vision encoder, adapted to handle varying image resolutions effectively by employing patch packing mechanisms similar to NaViT. The MLP projector facilitates the bridging of vision and language modalities, maintaining compatibility across computational operators used in language modeling. The MoE language decoder activates a compact parameter set to deliver efficient reasoning capabilities, showing significant improvements over traditional dense architectures in multimodal tasks.

Figure 1: The model architecture of Kimi-VL, consisting of a MoonViT vision encoder, an MLP projector, and a Mixture-of-Experts language decoder.

Training Paradigms

Kimi-VL's training is divided into multiple stages, beginning with standalone ViT training and followed by joint pre-training to solidify language and multimodal understanding. Specifically, it integrates image-text pairs during ViT training using contrastive and cross-entropy losses, then proceeds with joint stages to fuse language comprehension with visual capabilities. The final training phases focus on long-context activation, optimizing for extended sequence processing by scaling the context length significantly.

Figure 2: The pre-training stages of Kimi-VL consuming a total of 4.4 trillion tokens, emphasizing joint training phases.

Evaluation and Performance

Kimi-VL demonstrates competitive performance across various benchmarks such as MMMU, MathVista, and InfoVQA, indicating its effectiveness in multimodal reasoning and understanding. The extended context window allows for nuanced comprehension of long-form inputs, contributing to substantial accuracy in document and video task evaluations. Moreover, Kimi-VL-Thinking, an advanced variant, further enhances reasoning through refined supervised fine-tuning and reinforcement learning techniques.

Figure 3: Highlights of Kimi-VL performance across general, OCR, multi-image, long video, document, and agent benchmarks.

Infrastructure and Optimization

The infrastructure supporting Kimi-VL emphasizes efficiency through parallel processing strategies including data, expert, pipeline, and context parallelism. These strategies significantly improve throughput, facilitating large-scale training operations with optimal resource utilization. Enhanced Muon optimization ensures robust data handling and parameter updating, contributing to seamless integration of multimodal inputs during training processes.

Conclusion

This report presents Kimi-VL as a highly efficient, multimodal VLM model, designed to excel in complex reasoning scenarios across various domains. Despite its optimized architecture, challenges regarding handling domain-specific or long-context tasks persist but are manageable with further scaling and algorithm refinement. Kimi-VL represents a substantial advancement in multimodal AI, setting the stage for future research and applications that require sophisticated understanding across image, text, and video modalities.