Mantis VLA: Vision-Language-Action Framework

Updated 5 February 2026

Mantis VLA is a multimodal framework that integrates visual, language, and action information for instruction-grounded robotic control.
Its Disentangled Visual Foresight (DVF) module decouples next-frame prediction from reasoning, enabling rapid convergence and improved sample efficiency.
The design leverages a diffusion transformer and meta queries, enhancing prediction accuracy and computational efficiency while maintaining robust language grounding.

Mantis is a Vision-Language-Action (VLA) model designed to address the computational and representational challenges associated with integrating vision, language, and action for multimodal, instruction-grounded robotic control. Distinguished by its Disentangled Visual Foresight (DVF) module and explicit separation of future-prediction from multimodal instruction reasoning, Mantis advances state of the art in VLA systems by achieving higher generalization, reasoning capability, and sample efficiency compared to entangled approaches (Yang et al., 20 Nov 2025).

1. System Architecture

Mantis consists of three principal subsystems: a vision-language backbone (“P”), a Disentangled Visual Foresight (DVF) head implemented as a Diffusion Transformer (“DiT”) with meta (latent-action) queries, and an action head with learnable action queries. Information flow follows:

Inputs: At time $t$ , the model receives a raw image $o_t$ and a tokenized language instruction $l$ .
Meta Queries [LAT]: Appended to the input token sequence, these enable the model to extract latent action representations relevant for dynamics.
Backbone (P): Processes visual and language tokens via interleaved cross-attention, producing a fused embedding $h_t$ .
DVF Head (DiT): Receives $h_t$ (via a linear connector $C$ ) and a residual connection from $o_t$ ; tasked with next-frame prediction.
Action Head: Operates on $h_t$ with [LAT] and explicit [ACT] queries, generating the action sequence $a_{t:t+n}$ .

Block Diagram

┌────────────────────────────┐
│  Language tokens (l)       │
│  Current frame (oₜ)        │
│  Latent-action queries     │
│  [LAT]                     │
└─────────┬──────────────────┘
          │
          ▼
 ┌─────────────┐
 │  VLM        │
 │  backbone   │
 │  P(·)       │
 └─────────────┘
          │ hₜ
     ┌────┴──────┐
     │           │
     ▼           ▼
┌─────────────┐  ┌──────────────────┐
│ Conn. C(·)  │  │  Action head π(·)│
│ + res. oₜ   │  │  ([LAT],[ACT])   │
└─────┬───────┘  └──────────────────┘
      │                         │
      ▼                         ▼
┌────────────┐           ┌───────────────┐
│ DiT Head   │           │ Explicit      │
│ (predicts  │           │ actions       │
│  oₜ₊ₙ)     │           │  a            │
└────────────┘           └───────────────┘

2. Disentangled Visual Foresight (DVF) Mechanism

Conventional VLA approaches that train the backbone to output high-dimensional future visual states entangle perception, reasoning, prediction, and action, resulting in heavy computational overhead and diminished language reasoning. Mantis introduces disentanglement by isolating future-frame prediction (DVF) from the multimodal backbone, thus:

Meta Queries [LAT]: Learn token-wise representations capturing inter-frame dynamics, initialized as $Q_\text{meta} \in \mathbb{R}^{N_\text{lat}\times D}$ with $D$ the hidden dimension. These are appended to P’s token stream.
DVF Next-State Objective: Uses a Denoising Diffusion Probabilistic Model (DDPM) loss over VAE-encoded latent codes $x_0$ of the future frame $o_{t+n}$ :

$z_t = \alpha_t x_0 + \sigma_t \epsilon,\quad \epsilon\sim\mathcal{N}(0, I)$

$\mathcal{L}_\text{DVF} = \mathbb{E}_{x_0, \epsilon, t} \left[\| \epsilon - \epsilon_\theta(z_t, t, \text{cond}) \|^2\right]$

with “cond” = $C(o_t, h_t)$ , where $o_t$ is also injected into DiT via a residual linear projection.

Residual Connection: $u = \phi_\text{cond}(C(o_t, h_t)) + \psi_\text{skip}(E_\text{vae}(o_t))$ ; this avoids redundancy by preventing DiT from reconstructing static content, letting [LAT] specialize in motion.

3. Language Integration and Multimodal Representation

Mantis employs Qwen2.5-VL as its vision-language backbone, ensuring robust fusion of text and image information. In each transformer block, language and image tokens interact through cross-attention:

$Q = W_Q h_\text{text},\quad K = W_K h_\text{vis},\quad V = W_V h_\text{vis}$

$\text{Attn}(Q, K, V) = \text{softmax}(QK^\top/\sqrt{d})V$

This multimodal fusion preserves language reasoning even as the architecture scales, as DVF offloads dynamics modeling from the backbone, which thus retains its capacity for comprehension and instruction grounding.

4. Training Procedures and Objectives

Mantis follows a multi-stage pretraining and finetuning pipeline:

Stage 1 (Vision Only): Train DVF head and [LAT]/[GAP] queries on large human-manipulation datasets (SSV2), with backbone weights frozen. Random frame gaps (1–6) encourage diverse temporal reasoning.
Stage 2 (Vision + Action): DROID dataset (robotic video-action data); unfreezes action queries and optimizes joint loss $\alpha \mathcal{L}_\text{DVF} + \mathcal{L}_\text{action}$ .
Stage 3 (Multimodal, Add Language): Joint training on image–text and DROID data, unfrozen backbone, optimizing $\alpha \mathcal{L}_\text{DVF} + \mathcal{L}_\text{action} + \beta \mathcal{L}_\text{lang}$ . Typical optimizer: AdamW with weight decay 0.1, gradient norm clip 0.5, cosine LR schedules ( $10^{-4} \to 10^{-5}$ ). Batch sizes and data augmentations follow DeepSpeed conventions, e.g., center-crop to 512×512.

The diffusion schedule follows cosine annealing, with 30 steps for DVF and 10 for actions. Backbones incorporate up to 38 image–text corpora for robust language grounding.

5. Benchmark Results and Quantitative Performance

Performance of Mantis on the LIBERO benchmark, a diverse suite of instruction-following and manipulation tasks, is summarized as follows (post–finetuning, 30 epochs, $\alpha=0.1$ , no language loss):

Method	Spatial	Object	Goal	Long	Avg.
Diffusion Policy	78.3	92.5	68.3	50.5	72.4
OpenVLA	84.7	88.4	79.2	53.7	76.5
$\pi_0$	96.8	98.8	95.8	85.2	94.2
CoT-VLA	87.5	91.6	87.6	69.0	81.1
UnifiedVLA	95.4	98.8	93.6	94.0	95.5
$ƒ_1$	98.2	97.8	95.4	91.3	95.7
Mantis	98.8	99.2	94.4	94.2	96.7

Mantis demonstrates highest average success (96.7%) and achieves >90% on Spatial suite in approximately 5 epochs (over 2× faster convergence than UnifiedVLA).
In real-world Agilex robot evaluations, for in-domain instructions, success is 8.5/10 (Mantis) vs 7.9/10 ( $\pi_{0.5}$ ); for out-of-domain (OOD) instructions, 7.0/10 vs 2.8/10—indicating superior generalization and reasoning (Yang et al., 20 Nov 2025).

6. Implications, Strengths, and Limitations

Disentanglement of visual foresight in Mantis directly addresses the “information bottleneck” and distributed capacity problems inherent in end-to-end VLA models. By shifting next-frame prediction to a diffusion-based head, the backbone is able to devote modeling power to instruction following and semantic reasoning, empirically verified on VQA/OCRBench (small drop from base Qwen2.5-VL):

Convergence: Mantis achieves sample efficiency gains, converging >2× faster than entangled visual-foresight models.
Generalization: Marked improvement in handling OOD instructions, compositional reasoning, and instruction specificity ("put the bear on (3+5)").
Computational Efficiency: Released Adaptive Temporal Ensemble (ATE) variant reduces inference by 50% (Yang et al., 20 Nov 2025).
Limitations: As with all data-driven VLA models, success is bounded by pretraining coverage and quality of action labeling. The explicit isolation of dynamics from vision-language reasoning may reduce interpretability of latent-action encoding.

7. Relation to Other VLA Systems and Future Directions

Mantis distinguishes itself from frameworks such as UAV-VLA (Sautenkov et al., 9 Jan 2025) in both architecture and task domain. UAV-VLA targets large-scale aerial mission planning via satellite imagery, integrating zero-shot GPT goal extraction, visual localization (Molmo-7B-D), and geometric path/action generation. In contrast, Mantis is focused on grounded manipulation and mobile robotics, emphasizing instruction-following, generalization and rapid convergence through architectural disentanglement.

A plausible implication is that the separation of foresight and reasoning components could be generalized to other high-dimensional control domains, mitigating information bottlenecks and improving reasoning in VLA systems broadly. The publicly released codebase and weights (including ATE) support reproducibility and extension.

Continued research may explore further variations on meta-query representations, cross-modal scaling, and real-world safety/robustness under domain shift.

Markdown Report Issue Upgrade to Chat

References (2)

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight (2025)

UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mantis Vision-Language-Action (VLA) Framework.

Mantis VLA: Vision-Language-Action Framework

1. System Architecture

2. Disentangled Visual Foresight (DVF) Mechanism

3. Language Integration and Multimodal Representation

4. Training Procedures and Objectives

5. Benchmark Results and Quantitative Performance

6. Implications, Strengths, and Limitations

7. Relation to Other VLA Systems and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Mantis VLA: Vision-Language-Action Framework

1. System Architecture

2. Disentangled Visual Foresight (DVF) Mechanism

3. Language Integration and Multimodal Representation

4. Training Procedures and Objectives

5. Benchmark Results and Quantitative Performance

6. Implications, Strengths, and Limitations

7. Relation to Other VLA Systems and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research