Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mantis VLA: Vision-Language-Action Framework

Updated 5 February 2026
  • Mantis VLA is a multimodal framework that integrates visual, language, and action information for instruction-grounded robotic control.
  • Its Disentangled Visual Foresight (DVF) module decouples next-frame prediction from reasoning, enabling rapid convergence and improved sample efficiency.
  • The design leverages a diffusion transformer and meta queries, enhancing prediction accuracy and computational efficiency while maintaining robust language grounding.

Mantis is a Vision-Language-Action (VLA) model designed to address the computational and representational challenges associated with integrating vision, language, and action for multimodal, instruction-grounded robotic control. Distinguished by its Disentangled Visual Foresight (DVF) module and explicit separation of future-prediction from multimodal instruction reasoning, Mantis advances state of the art in VLA systems by achieving higher generalization, reasoning capability, and sample efficiency compared to entangled approaches (Yang et al., 20 Nov 2025).

1. System Architecture

Mantis consists of three principal subsystems: a vision-language backbone (“P”), a Disentangled Visual Foresight (DVF) head implemented as a Diffusion Transformer (“DiT”) with meta (latent-action) queries, and an action head with learnable action queries. Information flow follows:

  • Inputs: At time tt, the model receives a raw image oto_t and a tokenized language instruction ll.
  • Meta Queries [LAT]: Appended to the input token sequence, these enable the model to extract latent action representations relevant for dynamics.
  • Backbone (P): Processes visual and language tokens via interleaved cross-attention, producing a fused embedding hth_t.
  • DVF Head (DiT): Receives hth_t (via a linear connector CC) and a residual connection from oto_t; tasked with next-frame prediction.
  • Action Head: Operates on hth_t with [LAT] and explicit [ACT] queries, generating the action sequence at:t+na_{t:t+n}.

Block Diagram

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌────────────────────────────┐
│  Language tokens (l)       │
│  Current frame (oₜ)        │
│  Latent-action queries     │
│  [LAT]                     │
└─────────┬──────────────────┘
          │
          ▼
 ┌─────────────┐
 │  VLM        │
 │  backbone   │
 │  P(·)       │
 └─────────────┘
          │ hₜ
     ┌────┴──────┐
     │           │
     ▼           ▼
┌─────────────┐  ┌──────────────────┐
│ Conn. C(·)  │  │  Action head π(·)│
│ + res. oₜ   │  │  ([LAT],[ACT])   │
└─────┬───────┘  └──────────────────┘
      │                         │
      ▼                         ▼
┌────────────┐           ┌───────────────┐
│ DiT Head   │           │ Explicit      │
│ (predicts  │           │ actions       │
│  oₜ₊ₙ)     │           │  a            │
└────────────┘           └───────────────┘

2. Disentangled Visual Foresight (DVF) Mechanism

Conventional VLA approaches that train the backbone to output high-dimensional future visual states entangle perception, reasoning, prediction, and action, resulting in heavy computational overhead and diminished language reasoning. Mantis introduces disentanglement by isolating future-frame prediction (DVF) from the multimodal backbone, thus:

  • Meta Queries [LAT]: Learn token-wise representations capturing inter-frame dynamics, initialized as QmetaRNlat×DQ_\text{meta} \in \mathbb{R}^{N_\text{lat}\times D} with DD the hidden dimension. These are appended to P’s token stream.
  • DVF Next-State Objective: Uses a Denoising Diffusion Probabilistic Model (DDPM) loss over VAE-encoded latent codes x0x_0 of the future frame ot+no_{t+n}:

zt=αtx0+σtϵ,ϵN(0,I)z_t = \alpha_t x_0 + \sigma_t \epsilon,\quad \epsilon\sim\mathcal{N}(0, I)

LDVF=Ex0,ϵ,t[ϵϵθ(zt,t,cond)2]\mathcal{L}_\text{DVF} = \mathbb{E}_{x_0, \epsilon, t} \left[\| \epsilon - \epsilon_\theta(z_t, t, \text{cond}) \|^2\right]

with “cond” = C(ot,ht)C(o_t, h_t), where oto_t is also injected into DiT via a residual linear projection.

  • Residual Connection: u=ϕcond(C(ot,ht))+ψskip(Evae(ot))u = \phi_\text{cond}(C(o_t, h_t)) + \psi_\text{skip}(E_\text{vae}(o_t)); this avoids redundancy by preventing DiT from reconstructing static content, letting [LAT] specialize in motion.

3. Language Integration and Multimodal Representation

Mantis employs Qwen2.5-VL as its vision-language backbone, ensuring robust fusion of text and image information. In each transformer block, language and image tokens interact through cross-attention:

Q=WQhtext,K=WKhvis,V=WVhvisQ = W_Q h_\text{text},\quad K = W_K h_\text{vis},\quad V = W_V h_\text{vis}

Attn(Q,K,V)=softmax(QK/d)V\text{Attn}(Q, K, V) = \text{softmax}(QK^\top/\sqrt{d})V

This multimodal fusion preserves language reasoning even as the architecture scales, as DVF offloads dynamics modeling from the backbone, which thus retains its capacity for comprehension and instruction grounding.

4. Training Procedures and Objectives

Mantis follows a multi-stage pretraining and finetuning pipeline:

  • Stage 1 (Vision Only): Train DVF head and [LAT]/[GAP] queries on large human-manipulation datasets (SSV2), with backbone weights frozen. Random frame gaps (1–6) encourage diverse temporal reasoning.
  • Stage 2 (Vision + Action): DROID dataset (robotic video-action data); unfreezes action queries and optimizes joint loss αLDVF+Laction\alpha \mathcal{L}_\text{DVF} + \mathcal{L}_\text{action}.
  • Stage 3 (Multimodal, Add Language): Joint training on image–text and DROID data, unfrozen backbone, optimizing αLDVF+Laction+βLlang\alpha \mathcal{L}_\text{DVF} + \mathcal{L}_\text{action} + \beta \mathcal{L}_\text{lang}. Typical optimizer: AdamW with weight decay 0.1, gradient norm clip 0.5, cosine LR schedules (10410510^{-4} \to 10^{-5}). Batch sizes and data augmentations follow DeepSpeed conventions, e.g., center-crop to 512×512.

The diffusion schedule follows cosine annealing, with 30 steps for DVF and 10 for actions. Backbones incorporate up to 38 image–text corpora for robust language grounding.

5. Benchmark Results and Quantitative Performance

Performance of Mantis on the LIBERO benchmark, a diverse suite of instruction-following and manipulation tasks, is summarized as follows (post–finetuning, 30 epochs, α=0.1\alpha=0.1, no language loss):

Method Spatial Object Goal Long Avg.
Diffusion Policy 78.3 92.5 68.3 50.5 72.4
OpenVLA 84.7 88.4 79.2 53.7 76.5
π0\pi_0 96.8 98.8 95.8 85.2 94.2
CoT-VLA 87.5 91.6 87.6 69.0 81.1
UnifiedVLA 95.4 98.8 93.6 94.0 95.5
ƒ1ƒ_1 98.2 97.8 95.4 91.3 95.7
Mantis 98.8 99.2 94.4 94.2 96.7
  • Mantis demonstrates highest average success (96.7%) and achieves >90% on Spatial suite in approximately 5 epochs (over 2× faster convergence than UnifiedVLA).
  • In real-world Agilex robot evaluations, for in-domain instructions, success is 8.5/10 (Mantis) vs 7.9/10 (π0.5\pi_{0.5}); for out-of-domain (OOD) instructions, 7.0/10 vs 2.8/10—indicating superior generalization and reasoning (Yang et al., 20 Nov 2025).

6. Implications, Strengths, and Limitations

Disentanglement of visual foresight in Mantis directly addresses the “information bottleneck” and distributed capacity problems inherent in end-to-end VLA models. By shifting next-frame prediction to a diffusion-based head, the backbone is able to devote modeling power to instruction following and semantic reasoning, empirically verified on VQA/OCRBench (small drop from base Qwen2.5-VL):

  • Convergence: Mantis achieves sample efficiency gains, converging >2× faster than entangled visual-foresight models.
  • Generalization: Marked improvement in handling OOD instructions, compositional reasoning, and instruction specificity ("put the bear on (3+5)").
  • Computational Efficiency: Released Adaptive Temporal Ensemble (ATE) variant reduces inference by 50% (Yang et al., 20 Nov 2025).
  • Limitations: As with all data-driven VLA models, success is bounded by pretraining coverage and quality of action labeling. The explicit isolation of dynamics from vision-language reasoning may reduce interpretability of latent-action encoding.

7. Relation to Other VLA Systems and Future Directions

Mantis distinguishes itself from frameworks such as UAV-VLA (Sautenkov et al., 9 Jan 2025) in both architecture and task domain. UAV-VLA targets large-scale aerial mission planning via satellite imagery, integrating zero-shot GPT goal extraction, visual localization (Molmo-7B-D), and geometric path/action generation. In contrast, Mantis is focused on grounded manipulation and mobile robotics, emphasizing instruction-following, generalization and rapid convergence through architectural disentanglement.

A plausible implication is that the separation of foresight and reasoning components could be generalized to other high-dimensional control domains, mitigating information bottlenecks and improving reasoning in VLA systems broadly. The publicly released codebase and weights (including ATE) support reproducibility and extension.

Continued research may explore further variations on meta-query representations, cross-modal scaling, and real-world safety/robustness under domain shift.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mantis Vision-Language-Action (VLA) Framework.