Mantis VLA: Vision-Language-Action Framework
- Mantis VLA is a multimodal framework that integrates visual, language, and action information for instruction-grounded robotic control.
- Its Disentangled Visual Foresight (DVF) module decouples next-frame prediction from reasoning, enabling rapid convergence and improved sample efficiency.
- The design leverages a diffusion transformer and meta queries, enhancing prediction accuracy and computational efficiency while maintaining robust language grounding.
Mantis is a Vision-Language-Action (VLA) model designed to address the computational and representational challenges associated with integrating vision, language, and action for multimodal, instruction-grounded robotic control. Distinguished by its Disentangled Visual Foresight (DVF) module and explicit separation of future-prediction from multimodal instruction reasoning, Mantis advances state of the art in VLA systems by achieving higher generalization, reasoning capability, and sample efficiency compared to entangled approaches (Yang et al., 20 Nov 2025).
1. System Architecture
Mantis consists of three principal subsystems: a vision-language backbone (“P”), a Disentangled Visual Foresight (DVF) head implemented as a Diffusion Transformer (“DiT”) with meta (latent-action) queries, and an action head with learnable action queries. Information flow follows:
- Inputs: At time , the model receives a raw image and a tokenized language instruction .
- Meta Queries [LAT]: Appended to the input token sequence, these enable the model to extract latent action representations relevant for dynamics.
- Backbone (P): Processes visual and language tokens via interleaved cross-attention, producing a fused embedding .
- DVF Head (DiT): Receives (via a linear connector ) and a residual connection from ; tasked with next-frame prediction.
- Action Head: Operates on with [LAT] and explicit [ACT] queries, generating the action sequence .
Block Diagram
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
┌────────────────────────────┐
│ Language tokens (l) │
│ Current frame (oₜ) │
│ Latent-action queries │
│ [LAT] │
└─────────┬──────────────────┘
│
▼
┌─────────────┐
│ VLM │
│ backbone │
│ P(·) │
└─────────────┘
│ hₜ
┌────┴──────┐
│ │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ Conn. C(·) │ │ Action head π(·)│
│ + res. oₜ │ │ ([LAT],[ACT]) │
└─────┬───────┘ └──────────────────┘
│ │
▼ ▼
┌────────────┐ ┌───────────────┐
│ DiT Head │ │ Explicit │
│ (predicts │ │ actions │
│ oₜ₊ₙ) │ │ a │
└────────────┘ └───────────────┘ |
2. Disentangled Visual Foresight (DVF) Mechanism
Conventional VLA approaches that train the backbone to output high-dimensional future visual states entangle perception, reasoning, prediction, and action, resulting in heavy computational overhead and diminished language reasoning. Mantis introduces disentanglement by isolating future-frame prediction (DVF) from the multimodal backbone, thus:
- Meta Queries [LAT]: Learn token-wise representations capturing inter-frame dynamics, initialized as with the hidden dimension. These are appended to P’s token stream.
- DVF Next-State Objective: Uses a Denoising Diffusion Probabilistic Model (DDPM) loss over VAE-encoded latent codes of the future frame :
with “cond” = , where is also injected into DiT via a residual linear projection.
- Residual Connection: ; this avoids redundancy by preventing DiT from reconstructing static content, letting [LAT] specialize in motion.
3. Language Integration and Multimodal Representation
Mantis employs Qwen2.5-VL as its vision-language backbone, ensuring robust fusion of text and image information. In each transformer block, language and image tokens interact through cross-attention:
This multimodal fusion preserves language reasoning even as the architecture scales, as DVF offloads dynamics modeling from the backbone, which thus retains its capacity for comprehension and instruction grounding.
4. Training Procedures and Objectives
Mantis follows a multi-stage pretraining and finetuning pipeline:
- Stage 1 (Vision Only): Train DVF head and [LAT]/[GAP] queries on large human-manipulation datasets (SSV2), with backbone weights frozen. Random frame gaps (1–6) encourage diverse temporal reasoning.
- Stage 2 (Vision + Action): DROID dataset (robotic video-action data); unfreezes action queries and optimizes joint loss .
- Stage 3 (Multimodal, Add Language): Joint training on image–text and DROID data, unfrozen backbone, optimizing . Typical optimizer: AdamW with weight decay 0.1, gradient norm clip 0.5, cosine LR schedules (). Batch sizes and data augmentations follow DeepSpeed conventions, e.g., center-crop to 512×512.
The diffusion schedule follows cosine annealing, with 30 steps for DVF and 10 for actions. Backbones incorporate up to 38 image–text corpora for robust language grounding.
5. Benchmark Results and Quantitative Performance
Performance of Mantis on the LIBERO benchmark, a diverse suite of instruction-following and manipulation tasks, is summarized as follows (post–finetuning, 30 epochs, , no language loss):
| Method | Spatial | Object | Goal | Long | Avg. |
|---|---|---|---|---|---|
| Diffusion Policy | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 |
| OpenVLA | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| 96.8 | 98.8 | 95.8 | 85.2 | 94.2 | |
| CoT-VLA | 87.5 | 91.6 | 87.6 | 69.0 | 81.1 |
| UnifiedVLA | 95.4 | 98.8 | 93.6 | 94.0 | 95.5 |
| 98.2 | 97.8 | 95.4 | 91.3 | 95.7 | |
| Mantis | 98.8 | 99.2 | 94.4 | 94.2 | 96.7 |
- Mantis demonstrates highest average success (96.7%) and achieves >90% on Spatial suite in approximately 5 epochs (over 2× faster convergence than UnifiedVLA).
- In real-world Agilex robot evaluations, for in-domain instructions, success is 8.5/10 (Mantis) vs 7.9/10 (); for out-of-domain (OOD) instructions, 7.0/10 vs 2.8/10—indicating superior generalization and reasoning (Yang et al., 20 Nov 2025).
6. Implications, Strengths, and Limitations
Disentanglement of visual foresight in Mantis directly addresses the “information bottleneck” and distributed capacity problems inherent in end-to-end VLA models. By shifting next-frame prediction to a diffusion-based head, the backbone is able to devote modeling power to instruction following and semantic reasoning, empirically verified on VQA/OCRBench (small drop from base Qwen2.5-VL):
- Convergence: Mantis achieves sample efficiency gains, converging >2× faster than entangled visual-foresight models.
- Generalization: Marked improvement in handling OOD instructions, compositional reasoning, and instruction specificity ("put the bear on (3+5)").
- Computational Efficiency: Released Adaptive Temporal Ensemble (ATE) variant reduces inference by 50% (Yang et al., 20 Nov 2025).
- Limitations: As with all data-driven VLA models, success is bounded by pretraining coverage and quality of action labeling. The explicit isolation of dynamics from vision-language reasoning may reduce interpretability of latent-action encoding.
7. Relation to Other VLA Systems and Future Directions
Mantis distinguishes itself from frameworks such as UAV-VLA (Sautenkov et al., 9 Jan 2025) in both architecture and task domain. UAV-VLA targets large-scale aerial mission planning via satellite imagery, integrating zero-shot GPT goal extraction, visual localization (Molmo-7B-D), and geometric path/action generation. In contrast, Mantis is focused on grounded manipulation and mobile robotics, emphasizing instruction-following, generalization and rapid convergence through architectural disentanglement.
A plausible implication is that the separation of foresight and reasoning components could be generalized to other high-dimensional control domains, mitigating information bottlenecks and improving reasoning in VLA systems broadly. The publicly released codebase and weights (including ATE) support reproducibility and extension.
Continued research may explore further variations on meta-query representations, cross-modal scaling, and real-world safety/robustness under domain shift.