Papers
Topics
Authors
Recent
Search
2000 character limit reached

OneTwoVLA: Unified VLA Model

Updated 11 November 2025
  • OneTwoVLA is a unified vision-language-action model that integrates visual input, language, and proprioception using a single transformer backbone.
  • It employs a learned gating mechanism to dynamically switch between explicit reasoning and rapid action, minimizing latency during task execution.
  • The model is co-trained on curated robot demonstrations and synthetic VL data, achieving superior long-horizon planning and error recovery in complex manipulation tasks.

OneTwoVLA is a unified vision-language-action (VLA) model designed to endow general-purpose robots with tightly integrated reasoning and acting capabilities. Unlike dual-system pipelines that rigidly separate high-level reasoning from low-level primitive control—often leading to latency and mutual understanding bottlenecks—OneTwoVLA features a single transformer backbone that adaptively interleaves explicit reasoning with fast action generation. This architecture is supported by a scalable pipeline for synthesizing embodied, reasoning-centric VL data co-trained with robot demonstrations, imparting strong generalization properties across long-horizon planning, real-time error recovery, interaction, and visual grounding. OneTwoVLA’s mode switching is governed by a learned gating mechanism that dynamically determines when to trigger deep reasoning versus action execution, enabling efficient, context-aware manipulation spanning tasks such as hotpot cooking and cocktail mixing.

1. Unified Model Architecture

OneTwoVLA employs a single auto-regressive transformer backbone—instantiated by extending the π₀ model—that processes a multimodal input stream. At each timestep tt, the model jointly ingests:

  • Multi-camera image observations It1:nI_t^{1:n}
  • Reference images Iref1:nI_\mathrm{ref}^{1:n}, capturing critical visual context tied to the latest reasoning segment
  • Language instruction \ell (inclusive of ongoing human–robot dialog)
  • Most recent reasoning content RR (short text summary)
  • Robot proprioception sts_t

All modalities are linearly projected into embeddings and concatenated as a sequence input to a shared stack of transformer layers.

Three output heads are attached to the transformer’s hidden state hth_t at each step:

  1. Decision Head: Outputs a softmax over tokens {[BOR],[BOA]}\{\mathrm{[BOR]}, \mathrm{[BOA]}\}, representing “begin reasoning” and “begin action.”
  2. Text-generation Head: Active in reasoning mode, generates natural language summaries token-wise.
  3. Action Head: Active in acting mode, produces a continuous action chunk AtA_t via flow-matching.

The learned gating mechanism g(xt,ht)g(x_t, h_t) (decision head output) determines which output head to activate, explicitly connecting high-level cognitive and low-level motor control in a single model.

2. Adaptive Reasoning and Acting

The core innovation is the adaptive scheduling of “when to think” and “when to act.” Unlike systems enforcing reasoning at every step (introducing large latencies) or never reasoning (yielding shallow policies), OneTwoVLA triggers explicit reasoning only at critical junctures such as subtask boundaries, error detection, or upon human requests.

Mathematically, at each step:

  • Preason=P([BOR]ht)P_\mathrm{reason} = P(\mathrm{[BOR]} | h_t)
  • Pact=P([BOA]ht)P_\mathrm{act} = P(\mathrm{[BOA]} | h_t)

Mode switching uses a gating function:

g(xt,ht)={1if Pθ([BOR]ht)>γ 0otherwiseg(x_t, h_t) = \begin{cases} 1 & \text{if } P_{\theta}(\mathrm{[BOR]}|h_t) > \gamma\ 0 & \text{otherwise} \end{cases}

with γ\gamma typically set to 0.5.

When g=1g=1: The model enters System Two, generating a new reasoning summary RR (chain-of-thought tokens), updating IrefI_\mathrm{ref}, and waiting for human input if necessary.

When g=0g=0: The model enters System One, generating the next low-level action chunk AtA_t.

Reasoning is invoked sparsely—typically 3–6 times per trial, resulting in 1–4 second interruptions per 2–4 minute manipulation run (less than 10% overhead).

3. Training Data and Learning Pipeline

A. Curated Robot Demonstrations

Tasks include multi-step manipulations (Tomato–Egg, Hotpot, Cocktail) and atomic skills (pick, place, open). Each demonstration trajectory is segmented into:

  • Reasoning intervals: Labeled with a four-part text R={scene description,high-level plan,historical summary,next step}R = \{ \text{scene description}, \text{high-level plan}, \text{historical summary}, \text{next step} \}, optionally appending dialog to \ell.
  • Acting intervals: Presented as supervised AtA_t chunks.

Data volume: approximately 3,000 trajectories for three tasks and 2,000 additional skill demonstrations.

B. Synthetic Vision-Language Data Generation

16,000 high-quality “embodied chain-of-thought” synthetic samples are created via a multi-stage process:

  1. GPT-4-class model prompts (Gemini 2.5 Pro) generate \sim100,000 diverse tabletop scene descriptions.
  2. Each is rendered into an image via FLUX.1-dev, applying random fisheye distortion and virtual robot gripper overlays for domain realism.
  3. Gemini 2.5 Pro is further prompted to generate, for each image:
    • 17 instruction–reasoning pairs (visual grounding tasks: direct reference, spatial relation, attribute, semantic), or
    • One long-horizon instruction and a stepwise plan.

Breakdown: $6,000$ images ×\times 17 = $102,000$ grounding samples, $10,000$ images ×\times 1 = $10,000$ plans; $16,000$ samples in total used for training (subset selection described in the data).

C. Co-training Strategy and Objective

Training alternates between robot and synthetic VL data. The combined loss:

Ltotal=Lact+λLreason+μLdecL_\mathrm{total} = L_\mathrm{act} + \lambda L_\mathrm{reason} + \mu L_\mathrm{dec}

where

  • LactL_\mathrm{act}: Flow-matching loss on continuous robot actions.
  • LreasonL_\mathrm{reason}: Cross-entropy on reasoning text during reasoning intervals.
  • LdecL_\mathrm{dec}: Cross-entropy loss over decision head.
  • λ1\lambda \approx 1, μ\mu kept small.

4. Empirical Capabilities and Performance

The model is evaluated across four key robotic capabilities:

Capability OneTwoVLA (VL) OneTwoVLA (robot only) π₀ Baseline Dual-System
Long-horizon Task Planning (avg) 87% - 57% 63%
Error Detection/Recovery (>80%) Yes - No Latency
Human–Robot Interaction (success) 90% - No Text Gen Context Loss
Visual Grounding (Open-world) 73% 8% 3–5% -
  • Long-Horizon Planning: In three 4–6 step tasks (Tomato–Egg, Hotpot, Cocktail; 20 trials each), OneTwoVLA achieves 85–95% success, outperforming both the prior π₀ model and a dual-system baseline by 30pp and 24pp, respectively.
  • Error Detection and Recovery: On injected failures, OneTwoVLA auto-detects errors by activating the reasoning gate, plans corrections, and recovers in over 80% of trials. π₀ fails to react; the dual-system is too slow to recover.
  • Natural Human–Robot Interaction: In trials with mid-task human intervention, OneTwoVLA incorporates dialog into \ell, enters reasoning, replans/asks clarification, and achieves 90% task completion following user input. Dual-system approaches lose context in 60% and π₀ cannot engage in dialog.
  • Visual Grounding and Generalization: In single-env (known objects) and open-world settings (180 objects/8 scenes), co-training with synthetic VL data improves open-world visual grounding from 8% (robot only) to 73%, substantiating the benefit of large, domain-tailored VL corpora.

5. Concrete Applications

OneTwoVLA has been validated on dexterous, multi-step manipulation scenarios, including:

  • Hotpot Preparation: Sequential dipping, temporal waiting, strainer handling.
  • Cocktail Mixing: Multi-liquid pouring, tool use (e.g., making Mountain Fuji with vodka, Blue Curacao, lemon juice, yogurt).
  • Generalization Tasks: Fetching specified objects (“icy cola” from fridge), clearing compound dishes prior to plate passing, out-of-reach retrieval using tools, mood-driven recipe planning.

These applications span precise manipulation, visual grounding in clutter, real-time adaptation, and dialog-driven replanning.

6. Current Limitations and Future Prospects

Documented limitations include:

  1. Manual Reasoning Annotation: All reasoning summaries are currently curated by human annotators. The integration of RL-from-human-feedback (RLHF) and self-consistent chain-of-thought from the LLM community is proposed to enable end-to-end refinement.
  2. Reasoning Latency: Although reasoning is sparingly triggered, each invocation introduces a 1–4s pause. Asynchrony—running reasoning and acting in parallel—is a suggested future direction.
  3. VL Data Source Diversity: The training pipeline thus far leverages only a single, high-quality source of synthetic data. Incorporation of broader, potentially noisier datasets (e.g., CC12M, WebVid) and domain adaptation is posited as a route to further enhance visual generalization.

A plausible implication is that the unification of acting and reasoning in a single gated transformer, paired with scalable, embodied VL data generation and sophisticated co-training, represents a viable pathway toward robust, general-purpose robot intelligence that is capable of context-sensitive, long-horizon interactive behavior.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OneTwoVLA Model.