InternVL 2.5-1B: Compact Multimodal LLM
- The paper introduces a 1B-scale multimodal model that integrates vision and language via a ViT-MLP-LLM architecture, achieving competitive performance with efficient parameter use.
- It employs a two-stage training pipeline with MLP warmup and full fine-tuning, leveraging dynamic tiling, pixel-unshuffle, and knowledge distillation for robust modality alignment.
- InternVL 2.5-1B sets new open-source benchmarks across multidisciplinary tasks, demonstrating improved reasoning, document analysis, and multimodal comprehension.
InternVL 2.5-1B is an open-source multimodal LLM (MLLM) that occupies the smallest scale in the InternVL 2.5 family, integrating vision and language processing within a compact architecture of approximately 0.9–1 billion parameters. The model was released by the OpenGVLab research group, with core design and benchmarking detailed in recent technical reports and papers (Gao et al., 2024, Chen et al., 2024). InternVL 2.5-1B achieves competitive open-source performance on a broad array of multimodal benchmarks, leveraging architectural refinements, data processing innovations, and advanced training paradigms.
1. Model Architecture
InternVL 2.5-1B adopts the standardized “ViT-MLP-LLM” paradigm, combining a vision encoder, a lightweight projection MLP, and an instruction-tuned LLM. The architectural composition is as follows:
- Vision Encoder: InternViT-300M-448px-V2.5 is used, featuring 24 transformer layers, hidden size 1 024, MLP dimension 4 096, and 16 attention heads. Images are dynamically split into up to 48 tiles of size 448 × 448 px. The encoder applies pixel-unshuffle to reduce each tile to 256 visual tokens, yielding improved local feature extraction via incremental pre-training on a curated image–language data mixture (Chen et al., 2024).
- MLP Projector: A two-layer, randomly initialized MLP bridges the 256 × 1 024–dimensional pooled vision representation to the LLM’s embedding space. In the initial stage, only this MLP is trainable, facilitating modality alignment prior to holistic fine-tuning.
- LLM: Qwen 2.5-0.5B-Instruct, a 500 M–parameter, 24-layer transformer with hidden dimension 1 024 and 16 heads, replaces the previous Qwen 2.0-0.5B backbone. This module offers full instruction-tuned generative capacity with improved conversational alignment (Chen et al., 2024).
- Tokenization & Input Organization: Pixel-unshuffled tokens and dynamic tiling enable efficient high-resolution input representation, maintaining up to 40–48 tiles (depending on implementation details) and supporting long-context processing up to 16 384 tokens.
| Module | Parameters | Key Dimensions |
|---|---|---|
| InternViT-300M | ~300 M | 24 × 1 024 × 16 (ViT layers) |
| MLP Projector | ~50 M | 2 layers, bottleneck ≈ 512 |
| Qwen 2.5-0.5B | ~500 M | 24 × 1 024 × 16 (LLM layers) |
2. Training Methodology and Data
InternVL 2.5-1B training proceeds through a rigorously staged instruction-tuning pipeline, designed to maximize multimodal alignment, robustness, and generalization:
- Stage 1 (MLP Warmup, ~191 B tokens): Vision and language backbones are frozen; only the projector MLP is trained. Data sequences up to 16 384 tokens pack up to 48 image tiles per sample. JPEG compression and dynamic aspect ratio selection are applied for augmentation and regularization; next-token-prediction (NTP) loss is reweighted per token (Chen et al., 2024).
- Stage 2 (Full Fine-Tuning, ~176 B tokens): All weights are trainable across 4 epochs on a strictly filtered, high-quality multimodal instruction dataset (16.3 M samples), optimized for conversational diversity and information density. Data filtering employs LLM-based scoring, heuristic repetition detection, and anomaly removal.
- Compression and Distillation: InternVL 2.5-1B achieves nearly 74% of InternVL2-76B’s general benchmark performance using only ≈1.3% of its parameters by leveraging knowledge distillation, pixel-unshuffle, and staged fine-tuning. Cosine-alignment loss guides distillation from larger models (Gao et al., 2024).
3. Performance Benchmarks
InternVL 2.5-1B is evaluated extensively on a suite of multidisciplinary, document, chart, real-world, hallucination, multilingual, video, and pure-language tasks. Relative to prior 1 B–scale models, InternVL 2.5-1B establishes new open-source baselines at this parameter scale.
- Multidisciplinary Reasoning & Mathematics: MMMU 40.9% (vs. 36.7% for InternVL 2.0-1B; LLaVA-0.5B at 31.4%), MathVista 43.2%, MathVerse 28.0%, OlympiadBench 1.7% (Chen et al., 2024).
- Document/OCR/Chart QA: AI2D mask/no-mask 69.3/77.8%, ChartQA 75.9%, DocVQA 84.8%, TextVQA 72.0%, InfoVQA 56.0%, OCRBench 785/1 000.
- Multi-Image Understanding: BLINK 42.0%, Mantis-Eval 51.2%, MIRB 35.6%, MMT-Bench 50.3%.
- Real-World Comprehension: RealWorldQA 57.5%, MME-RealWorld 44.2%, WildVision win-rate 43.4%, R-Bench 59.0%.
- Comprehensive Multimodal: MME 1 950.5, MMVet 48.8%, MMStar 50.1%.
- Hallucination: HallusionBench 39.0%, MMHal 2.49, POPE F1 89.9%.
- Multilingual: MMMB en/zh/pt/ar/tr/ru 78.8/70.2/61.5/55.0/45.3/61.1.
- Video Understanding: Video-MME 50.3–52.3%, MVBench 64.3%, LongVideoBench 47.9%.
- Pure Language: 17 text-only benchmarks averaging 41.3%, matching the pretrained LLM performance and reversing prior degradation.
Notably, larger closed-source MLLMs (e.g., GPT-4o, Claude-3.5) still significantly outperform InternVL 2.5-1B at 1 B scale, but InternVL 2.5-1B reduces the performance gap among open-source models (Chen et al., 2024).
4. Scaling Trends and Empirical Laws
InternVL 2.5 series exhibits empirical near-loglinear scaling between parameter count and OpenCompass score:
- Scaling Effects: InternVL 2.5-1B/2 B/4 B/8 B/26 B obtain OpenCompass scores of 54.5, 59.8, 65.1, 68.1, and 71.3, respectively; swapping Qwen 2.0 to Qwen 2.5 increases scores by ~1–3 points at each scale (Chen et al., 2024).
- Training Efficiency: Large InternViT models require 10× fewer tokens than comparably performing MLLMs with smaller vision encoders.
- Compression Ratio: At 1.3% parameter count, Mini-InternVL-1B achieves 74–90% performance compared to full-scale InternVL variants on general benchmarks (Gao et al., 2024).
- Effective Training Data: InternVL 2.5-1B uses ~360 B total training tokens, an order of magnitude less than other competitive models.
A plausible implication is that further innovation in encoder distillation and token efficiency may drive future scaling laws for compact MLLMs.
5. Unified Adaptation and Fine-Tuning Strategies
InternVL 2.5-1B supports a unified adaptation framework for diverse downstream tasks:
- Task Formatting: All downstream tasks—including visual grounding, region perception, classification, and multi-view/video—are reformulated into VQA-style instruction–response format utilizing tags such as <ref>…</ref> and <box>[x1,y1,x2,y2]</box> (Gao et al., 2024).
- Objective Composition: Composite training loss integrates cross-entropy (VQA), hidden-state cosine alignment (grounding), and box regression.
- Fine-Tuning Protocol: Tasks such as autonomous driving (DriveLM-nuScenes), medical QA, and remote sensing are supported. Full-parameter fine-tuning maximizes accuracy, while freeze-ViT excels in compute efficiency.
6. Test-Time Scaling and Chain-of-Thought Reasoning
InternVL 2.5-1B demonstrates substantive gains on complex reasoning tasks via test-time scaling techniques:
- Chain-of-Thought (CoT) Prompting: Performance on MMMU and other challenging benchmarks is enhanced by appending prompts such as “Let’s think step by step…” to evoke explicit multi-turn reasoning before answer extraction (Chen et al., 2024).
- Majority Voting & Ensembling: Aggregating outputs from multiple prompt samples via majority vote adds another ~1–2% accuracy.
This suggests that inference-side interventions can meaningfully improve compact MLLM performance, compensating for smaller model capacity without retraining.
7. Comparative Analysis, Limitations, and Implications
InternVL 2.5-1B sets a strong baseline for 1 B-parameter multimodal models, markedly improving over previous InternVL releases and comparable compact MLLMs such as DeepSeek-VL-1.3B and MiniCPM-V-3B (Gao et al., 2024). However, performance remains bounded by model scale—closed-source systems at comparable parameter count retain a 15–30 point advantage on flagship benchmarks.
The use of high-fidelity distillation, dynamic resolution tiling, instruction tuning, and test-time reasoning collectively highlight pathways for continued progress in scalable, efficient open-source multimodal AI. A plausible implication is that further advances in architecture, data quality, and adaptive inference could reduce the reliance on large-scale proprietary models for state-of-the-art visual-linguistic reasoning.