RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization
Abstract: Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See https://rdt-robotics.github.io/rdt2/ for more information.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper is about building a “generalist” robot brain that can understand words, see the world, and move its arms—all at the same time—and then work on brand‑new robots it has never seen before. The model is called RDT2. It aims to do open‑ended tasks (like “pick up the blue cup and put it on the shelf”) without extra training, even when the robot’s hardware is different.
What questions did the researchers ask?
They focused on three big questions, in simple terms:
- How can we collect a huge amount of good, cheap, real‑world robot teaching data, not just from labs but from many homes and situations?
- How can we train a single model that reads instructions (language), looks at camera images (vision), and controls the robot’s movements (actions) efficiently and fast enough for real time?
- Can this model handle brand‑new things at once—new objects, new rooms, new instructions, and even a new robot body—without extra training? This is called “zero‑shot cross‑embodiment generalization.”
How did they do it?
They combined smarter data collection with a three‑stage training process that connects language understanding to precise robot control.
Collecting lots of data with a portable “teaching handle” (UMI)
Instead of teleoperating expensive lab robots, they used a handheld device called UMI (Universal Manipulation Interface). Imagine a sturdy game controller with a camera and motion tracker. A human uses it to demonstrate how to move and grasp objects. Because UMI is “embodiment‑agnostic” (it doesn’t depend on a specific robot arm), the same demonstrations can be used to train robots with different hardware.
They redesigned UMI to make it stronger and more accurate:
- Used tougher materials and infrared tracking for precise 3D motion, even in cluttered or shiny environments.
- Used a compact “linkage gripper” to get into tight spaces.
With about 100 of these devices placed in 100+ homes, they collected over 10,000 hours of real‑life demonstrations—one of the largest open‑source robot datasets of this kind.
Training the model in three stages
Here’s an analogy to make each stage clear:
- Stage 1: Turn smooth motions into “action words” and teach the language‑vision backbone Think of a dance that’s continuous and fluid. They first convert this dance into a short list of named moves (“tokens”) using a method called Residual Vector Quantization (RVQ). This makes actions look more like words, which LLMs are great at learning. They then train a large vision‑LLM (a 7‑billion‑parameter model) to predict these action tokens from images and instructions. Why this matters: It’s fast to learn and keeps the model’s language knowledge intact.
- Stage 2: Add a small “action expert” to bring back smooth, precise motion Action tokens are handy, but real robot movements are continuous. So they freeze the big backbone and train a smaller “action expert” (about 400 million parameters) to turn the backbone’s understanding into smooth, continuous movements. They use “diffusion with flow matching,” which is like starting from noisy, messy motions and learning the fastest way to clean them up into the right action. This gives accurate, real‑valued control, guided by what the big model sees and understands.
- Stage 3: Make it ultra fast for real‑time tasks Diffusion usually takes several steps, which can be too slow for fast jobs like hitting a ping‑pong ball. So they “distill” the multi‑step process into a single quick step, like learning a shortcut after lots of practice. This keeps quality while making the robot react very quickly.
What did they find, and why is it important?
Here are the main results, described plainly:
- Zero‑shot generalization on four fronts at once: Without extra training, RDT2 could handle new objects, new rooms, new instructions, and even new robot arms. That’s rare. It shows the model learned skills that transfer across different real‑world setups.
- Big, reliable dataset helps a lot: With 10,000+ hours from many homes, the model saw far more variety than typical lab data. This variety made it better at generalizing.
- Scaling laws: more data + bigger models = steady improvement As they increased the amount of data and model size, performance improved in a predictable way. This is like saying “study more and have a bigger notebook, and your test scores steadily go up.” It suggests that continuing to scale data and model size should keep boosting robot smarts.
- Strong performance on tough tasks after light fine‑tuning: On tricky, real‑world tasks—folding clothes, clearing a table step‑by‑step, and even playing table tennis—RDT2 beat other state‑of‑the‑art methods. It was especially good with deformable objects (like cloth) and dynamic actions (like fast button presses or hitting a moving ball).
- Much faster control thanks to distillation: The final “one‑step” version reacts faster than other models, even some that are smaller. That’s important for safety and for tasks where timing matters.
What does this mean for the future?
In simple terms, this research moves us closer to helpful, general‑purpose robots that can:
- Learn from lots of everyday demonstrations collected cheaply, outside of labs.
- Understand open‑ended instructions and act in new places with new tools.
- Be moved to different robot bodies without starting over.
This could lower costs, speed up progress, and make robots more useful in homes, hospitals, and factories. The authors also point out important safety and privacy responsibilities: data from homes must protect people’s identities, and robots must have safety checks before working around humans.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains missing, uncertain, or unexplored in the paper that future researchers could concretely address.
- Cross-embodiment without hardware homogeneity: The approach requires the same camera model and a “physically consistent” gripper for transfer. It remains unclear how to achieve zero-shot transfer across heterogeneous cameras (different intrinsics, FOVs, extrinsics), grippers (suction, soft hands, parallel-jaw variants), and end-effectors without hardware standardization or manual calibration.
- Action space expressivity: The policy operates on 6-DoF end-effector pose plus gripper width; it does not address torque/impedance control, force feedback, whole-arm coordination, or multi-contact control needed for contact-rich and compliant manipulation.
- Temporal modeling: The formulation explicitly assumes the current RGB observation ot contains all necessary information and does not use historical observations. The impact of memory (e.g., recurrent/transformer state, episodic memory) on long-horizon and partially observable tasks is untested.
- Zero-shot on complex tasks: Zero-shot evaluation is limited to simple open-vocabulary tasks (pick/place/wipe/press/shake). Whether the method zero-shots to deformable-object manipulation, multi-step plans, or highly dynamic tasks (beyond fine-tuning) is not demonstrated.
- Language–action alignment in demonstrations: The paper does not specify how natural language instructions are paired with UMI demonstrations at scale (annotation pipeline, coverage, quality control, inter-rater reliability). Reproducible methods for collecting high-quality action-caption pairs are missing.
- Scaling laws tied to real-world control: The scaling analysis uses training loss as the proxy for generalization but does not connect tokens/parameters to task success rates, closed-loop stability, or safety in deployment. A scaling law for control metrics (e.g., success vs. data/compute) is needed.
- Tokenization effects on downstream performance: RVQ is evaluated via discretization error vs. token budget, but the causal impact of different tokenizers (RVQ vs. FAST vs. binning) on actual manipulation success and generalization is not quantified.
- Sensitivity to chunk size and tokenization depth: The choice of action chunk size Ta, RVQ depths m, codebook sizes K, and CNN encoder/decoder architectures is not systematically ablated for performance–latency trade-offs.
- Distillation stability and robustness: The one-step generator’s stability under closed-loop control, noise perturbations, varying step sizes, or OOD inputs is not analyzed. Sensitivity to teacher sampling schedule, noise distributions, and teacher–student mismatch remains open.
- Latency–control-loop interactions: Inference frequency is reported, but the end-to-end control loop latency (perception, planning, actuation) and its effect on task performance—especially in high-speed dynamics—are not characterized.
- Safety and failure analysis: There is no quantitative treatment of safety guardrails, runtime checks (collision, force limits), or systematic failure mode analysis (e.g., mislocalized objects, instruction misinterpretation, gripper misalignment) in zero-shot deployment.
- Data diversity quantification: Although data spans 100+ households, the dataset’s diversity (object taxonomies, scene layouts, lighting, demographics, socioeconomic/geographic variation) is not quantified. Bias assessment and stratified performance reporting are missing.
- Generalization across sensors and modalities: Input is RGB-only; integration of depth, inertial, tactile, force/torque, or proprioception—and their contribution to generalization, precision, and safety—is unexplored.
- Automatic cross-embodiment calibration: The practical steps (and algorithms) for automatic camera/gripper calibration, extrinsic alignment, and coordinate frame mapping required for zero-shot deployment are not provided or evaluated.
- Joint training vs. staged freezing: Stage 2 freezes the VLM backbone. Whether joint or iterative unfreezing improves language–action grounding, compositionality, and control fidelity is not assessed.
- Architecture ablations in Stage 2: The action expert uses GQA and cross-attends to every backbone layer, but alternatives (e.g., selective layer conditioning, adapter layers, normalizing flows, direct regression with uncertainty) are not compared.
- Evaluation breadth and reproducibility: Zero-shot tests are in controlled lab scenes with new objects; household-in-the-wild zero-shot evaluations, standardized benchmarks (e.g., DROID, RH20T, RoboArena protocols), and public task scripts for replicability are missing.
- Fairness in baseline comparisons: Fine-tuning comparisons to To-FAST and TTO.5 lack detail on equalization of data, compute, and training schedules. A controlled, standardized benchmarking protocol is needed.
- Instruction generalization robustness: Beyond de-duplication, robustness to paraphrases, compositional language (multi-object, temporal constraints), and ambiguous/underspecified instructions is not tested.
- Multi-arm and mobile manipulation: Claims focus on bimanual manipulation but do not explore zero-shot transfer to mobile manipulation (base control), locomotion, or coordination across more than two manipulators.
- Contact-rich/deformable/fluids zero-shot: While fine-tuning shows gains on deformables and dynamics, zero-shot performance on contact-rich tasks (e.g., insertion, tool use, fluids) is unreported.
- Closed-loop performance under perception errors: The effect of camera miscalibration, occlusions, motion blur, or tracker inaccuracies on policy performance and recovery behavior is not quantified.
- Energy/compute cost and efficiency: Training/inference compute, energy use, and carbon footprint are not analyzed; strategies for scaling with constrained resources (model compression, sparsity, mixed precision) are absent.
- Dataset licensing and privacy details: The paper mentions ethics but does not detail consent procedures, anonymization pipelines, licensing, or mechanisms for contributor data deletion and governance.
- Failure recovery and replanning: The policy’s ability to detect failures and replan (e.g., missed grasp, wrong placement) is not examined; integration with high-level planners or corrective loops is an open area.
- Instruction-to-action time alignment: How language timing (when instructions are issued) aligns with action chunks and whether policies handle delayed or streaming instructions is unclear.
- Real-world robustness metrics: Confidence intervals are reported for some tasks, but broader statistical robustness (e.g., bootstrap CIs, significance tests across variants, stratified results by environment) is limited.
- Theoretical grounding of embodiment-agnostic mapping: A formal analysis of the invariances (e.g., SE(3) alignment, gripper kinematics equivalence) that enable cross-embodiment transfer with UMI is missing; failure conditions and limits of this assumption are not characterized.
Practical Applications
Overview
Below is a structured analysis of practical, real-world applications derived from the RDT2 paper’s findings, methods, and innovations. The items are organized into Immediate Applications (deployable now) and Long-Term Applications (requiring further research, scaling, or development). Each entry highlights sector relevance, potential products/workflows, and key assumptions or dependencies that may impact feasibility.
Immediate Applications
- Cross-embodiment deployment of generalist manipulation policies (Robotics, Manufacturing, Warehousing)
- Use case: Roll out a single policy across different robot arms for open-vocabulary tasks (pick, place, wipe, press, shake) without per-robot fine-tuning.
- Tools/workflows: Install a standardized camera and linkage gripper; use RDT2-VQ/RDT2-FM/RDT2-UltraFast; adopt “4U” zero-shot evaluation protocol for commissioning.
- Assumptions/dependencies: Hardware adherence to the same camera/gripper models and calibration; RGB-only sensing suffices; basic safety guardrails; sufficient GPU/edge compute for 7B inference; task distribution close to training data.
- Rapid task commissioning in mixed fleets (Robotics Integration, Systems Engineering)
- Use case: Integrators can quickly enable new workcells (e.g., kitting/packing lines) by providing language instructions and minimal environment setup.
- Tools/workflows: Stage-3 distilled “UltraFast” policy for real-time control; standardized policy deployment pipeline with ROS; quick validation via repeated-trial success-rate harness.
- Assumptions/dependencies: Clear camera views; deterministic gripper actuation; cycle-time requirements align with RDT2-UltraFast throughput.
- Domestic service tasks in controlled home pilots (Daily Life, Assistive Robotics)
- Use case: Table bussing, laundry folding, fridge item retrieval, basic cleanup in controlled trials or assisted scenarios.
- Tools/workflows: Household pilot with the redesigned UMI hardware for fast teaching and policy adjustments; voice or text instructions; fenced safety zones.
- Assumptions/dependencies: Privacy-compliant data and deployment practices; human supervision; standardized gripper; consistent lighting.
- Assistive task automation in care settings (Healthcare, Elderly Care)
- Use case: Fetching items, pressing call buttons, wiping surfaces, opening containers under caregiver supervision.
- Tools/workflows: Predefined, safety-checked task lists; natural language prompts for caregivers; UltraFast variant for time-sensitive actions.
- Assumptions/dependencies: Medical safety protocols; risk assessments; restricted scope of tasks; regulatory compliance and facility approval.
- Robotics education and lab reproducibility (Academia, Education)
- Use case: Courses and labs can replicate “4U” zero-shot experiments, scaling-law analysis, and ablation studies using open-source datasets and training recipes.
- Tools/workflows: RDT2 training pipeline (Stage 1 RVQ + Stage 2 flow-matching + Stage 3 distillation); standardized UMI hardware; evaluation harness with repeated trials.
- Assumptions/dependencies: Access to GPUs; procurement or fabrication of the redesigned UMI; institutional safety procedures.
- Accelerated training for VLA models using RVQ + flow-matching (Software/AI Tooling)
- Use case: Teams training other VLA models can adopt the paper’s hybrid recipe for faster convergence and real-time inference.
- Tools/workflows: RVQ tokenizer for continuous action discretization; flow-matching loss for action expert; on-the-fly diffusion distillation to single-step generators.
- Assumptions/dependencies: Compatibility with existing VLM backbones; correct hyperparameter selection; code availability and licensing.
- Benchmarking and QA in robot policy evaluation (Robotics QA, Standards)
- Use case: Apply the 256–1000-trial repeated-test protocol with standard errors to validate generalization across unseen objects/scenes/instructions/embodiments.
- Tools/workflows: “4U” test methodology; deduplication of instructions vs. training corpus; reproducible scenes and object sets.
- Assumptions/dependencies: Access to unseen object sets; controlled test scenes; standardized reporting practices.
- Privacy and safety best practices in in-home data collection (Policy)
- Use case: Immediate adoption of privacy/anonymization protocols, consent forms, and deployment guardrails for household data collection and pilot deployments.
- Tools/workflows: Data handling SOPs; safety interlocks; incident reporting; participant communication templates.
- Assumptions/dependencies: Institutional review or ethics board oversight; compliance with local data protection laws; clear liability boundaries.
- Crowd-sourced manipulation data services using enhanced UMI (Data Operations, Startups)
- Use case: Operate a distributed network of redesigned UMIs to collect high-fidelity, in-the-wild data to expand open-source robot datasets.
- Tools/workflows: Device logistics and calibration; participant onboarding; automated data ingest/cleaning pipelines; contribution incentives.
- Assumptions/dependencies: Sustainable economics; participant recruitment; robust anonymization; continuous device QA.
Long-Term Applications
- General-purpose home robots with robust zero-shot capabilities (Consumer Robotics)
- Use case: Broad household assistance across diverse appliances and environments without per-home fine-tuning.
- Tools/workflows: Standardized embodiment-agnostic hardware spec (camera/gripper); rich voice interfaces; recovery strategies for failures.
- Assumptions/dependencies: Industry-wide hardware standardization; stronger safety certification; improved perception for clutter/occlusions; reliability at scale.
- Flexible small-batch manufacturing with language-instructed robots (Manufacturing)
- Use case: Rapidly reconfigured assembly, cable routing, fabric handling, and tool operation via natural language, minimal retraining.
- Tools/workflows: Policy packs for task families; integration with MES/ERP for instructions; multi-robot coordination; quality assurance loops.
- Assumptions/dependencies: Tactile/force sensing integration for delicate operations; robust high-speed dynamics handling; formal verification of long-horizon tasks.
- Logistics robots that adapt on-the-fly to new items and bins (Warehousing, Retail)
- Use case: Open-vocabulary picking and placement for changing SKUs and ad-hoc bin layouts.
- Tools/workflows: Continuous inventory-to-language mapping; online perception updates; task-level monitoring for error recovery.
- Assumptions/dependencies: Better object recognition for transparent/deformable items; improved gripper diversity; multimodal sensing (RGB-D, tactile).
- Healthcare-grade assistive robots with regulatory approval (Healthcare)
- Use case: Expanded scope of assistive tasks in hospitals/eldercare facilities (e.g., opening medication packets, handling soft materials).
- Tools/workflows: Clinical-grade safety systems; standardized task catalogs; audit trails and incident analyses.
- Assumptions/dependencies: Regulatory pathways (FDA/CE equivalents) for VLA-based autonomy; formal risk models; robust human-in-the-loop protocols.
- Embodiment-agnostic interface standards (UMI 2.0) across vendors (Standards, Policy, Industry Consortia)
- Use case: A cross-vendor specification for camera/gripper calibration, tracking, and policy portability enabling true plug-and-play manipulation.
- Tools/workflows: Reference designs; calibration toolkits; certification programs; interoperability test suites.
- Assumptions/dependencies: Governance via standards bodies; vendor cooperation; IP/licensing frameworks; security-by-design considerations.
- Global remote education and research networks using UMI (Academia, Education)
- Use case: MOOCs and distributed labs where students collect local manipulation data and contribute to shared datasets; replicate RDT2-scale studies.
- Tools/workflows: Low-cost UMI kits; cloud training/inference; collaborative benchmarking platforms.
- Assumptions/dependencies: Affordable hardware distribution; equitable cloud access; unified data schemas; privacy and consent tooling at scale.
- Commercial marketplaces for task “policy packs” (Software, Robotics Ecosystem)
- Use case: Distribution and maintenance of curated, safety-vetted policies for specific task families (e.g., “laundry,” “kitchen cleanup,” “packing”).
- Tools/workflows: Versioning and compatibility checks across embodiments; automated regression tests; usage analytics and updates.
- Assumptions/dependencies: Liability frameworks; customer support; patching for edge cases; secure policy distribution.
- Safety certification frameworks for VLA robots (Policy, Regulatory)
- Use case: A standardized rubric to certify zero-shot policies using “4U” tests, repeated-trial statistics, and guardrail conformance.
- Tools/workflows: Public testbeds; documented risk scores; standardized incident reporting; third-party audits.
- Assumptions/dependencies: Regulatory acceptance; consensus on metrics; funding for certification infrastructure.
- Middleware for cross-embodiment control (Software/Robotics)
- Use case: ROS packages or SDKs that transparently map UMI-trained policies to heterogeneous robot arms and grippers.
- Tools/workflows: Auto-calibration, time-sync, and spatial alignment modules; device drivers; monitoring dashboards.
- Assumptions/dependencies: Broad driver support; robust calibration under wear/tear; standardized APIs.
- Energy-efficient autonomy via distilled single-step policies (Energy, Edge AI)
- Use case: Lower compute and power draw in mobile robots by replacing multi-step diffusion with distilled one-step generators.
- Tools/workflows: Edge deployment toolchains; performance-power profiling; adaptive model switching based on task dynamics.
- Assumptions/dependencies: Continued advances in on-device accelerators; robust distillation across more tasks; thermal management.
Glossary
- 6-DoF: Six degrees of freedom describing 3D position and orientation of a rigid body. Example: "UMI records the 6-DoF end- effector pose"
- Ablation studies: Experimental analyses that remove or alter components to assess their contribution. Example: "extensive ablation studies demonstrated the effectiveness of the adopted training strategy and design choices."
- Action chunk: A short sequence of consecutive actions treated as a unit for prediction or generation. Example: "an action chunk At := (at, ... , at+Ta)"
- Action expert: A specialized policy module that generates actions, often conditioned on features from a backbone model. Example: "we freeze the pretrained VLA backbone from Stage 1 and train a dedicated action expert."
- Autoregressive inference: Sequential prediction where each token/action is generated conditioned on previously generated ones. Example: "the in- efficiency of autoregressive inference."
- Bimanual manipulation: Robotic control involving two arms/hands operating simultaneously. Example: "We consider the bimanual manipulation task"
- Binning (uniform binning): Discretizing continuous values into uniformly spaced bins. Example: "The uni- form binning (Brohan et al., 2022; Zitkovich et al., 2023) achieved the lowest error"
- Chunk size: The number of time steps included in an action chunk. Example: "Ta is the chunk size (Zhao et al., 2023)"
- Codebook: A set of discrete vectors used to quantize continuous representations in vector quantization. Example: "ej € RKXC is the learnable code- book of size K at depth j."
- Codebook collapse: Failure mode where only a small subset of codebook entries are used during vector quantization. Example: "To mitigate the notorious codebook collapse, we have taken several measures during RVQ training"
- Compositional generalization: Ability to generalize to novel combinations of known elements (e.g., objects, scenes, instructions, embodiments). Example: "Achieving this compositional generalization"
- Cosine similarity: A similarity metric based on the cosine of the angle between vectors. Example: "replacing the Euclidean distance with cosine similarity in Eq. (1)"
- Cross-attention: Attention mechanism that conditions one sequence on another (e.g., actions on vision-language features). Example: "leverages cross-attention to incorporate the latent features from each layer of the VLA backbone."
- Cross-embodiment deployment: Applying a learned policy to robots with different physical embodiments. Example: "VLA models confront a significant limita- tion for cross-embodiment deployment."
- Cross-entropy loss: Standard classification loss measuring divergence between predicted and true discrete distributions. Example: "train the VLM by minimizing the cross-entropy loss."
- Deformable objects: Objects that change shape under force, complicating manipulation. Example: "complex manipulation tasks involving deformable objects and fluids"
- Denoising network: Model that maps noisy inputs to cleaner estimates (e.g., in diffusion or flow matching). Example: "ve(.) is the denois- ing network with trainable parameters 0"
- Diffusion distillation: Converting a multi-step diffusion policy into a faster, fewer-step (or one-step) generator via distillation. Example: "we employ diffusion distillation (Salimans & Ho, 2022; Chen et al., 2023) to convert the expert policy trained in Stage 2 into a single-step generator."
- Diffusion models: Generative models that learn to reverse a noising process to sample from complex distributions. Example: "Alternative methods using diffusion models"
- Distillation (loss): Training a compact model to match the outputs of a larger teacher model, often via regression. Example: "we propose a simple yet effective distillation loss and distill the action expert into a single-step generator"
- Distributional gap: Mismatch between training and deployment data distributions that harms generalization. Example: "cre- ating a distributional gap between the training data and real-world applications."
- Embodiment-agnostic: Designed to function independently of robot-specific hardware details. Example: "provides an embodiment-agnostic, handheld device"
- End-effector: The robot’s tool or gripper at the end of a manipulator arm that interacts with the environment. Example: "UMI records the 6-DoF end- effector pose"
- Exponential moving average (EMA): Smoothing technique for parameter updates that weights recent values more. Example: "smoothing codebook updates via exponential moving av- erage (EMA) (Razavi et al., 2019)"
- Fine-tuning: Additional training on a target task/dataset to adapt a pretrained model. Example: "fine-tuning experiments"
- Flow matching: Training framework that learns continuous-time velocity fields to transform noise into data. Example: "The action expert is supervised by a flow-matching loss"
- Grouped Query Attention (GQA): Attention variant that groups queries to reduce compute/memory while maintaining performance. Example: "substituting Multi-Head Attention (MHA) (Vaswani et al., 2017) with Grouped Query Atten- tion (GQA) (Ainslie et al., 2023)."
- Imitation learning: Learning policies from demonstrations rather than explicit reward signals. Example: "language-conditioned imitation learning for VLA mod- els"
- In-the-wild data collection: Gathering data in unstructured real-world environments rather than controlled labs. Example: "a portable framework facilitating scalable, in-the-wild data collection."
- Integration steps: Discrete steps used to numerically integrate or iterate the generative process (e.g., in diffusion/flow). Example: "we set the step size T = 0.2, corresponding to 5 integration steps."
- Latent space: Learned feature space where different modalities are projected for joint modeling. Example: "We project various modalities to a unified latent space"
- Linkage gripper: A gripper mechanism using linkages to transmit motion for compact, dexterous grasping. Example: "Linkage Gripper"
- Long-horizon tasks: Tasks requiring many sequential steps with temporal dependencies. Example: "long-horizon, and dynamic downstream tasks like playing table tennis."
- Multi-Head Attention (MHA): Transformer attention mechanism with multiple parallel attention heads. Example: "substituting Multi-Head Attention (MHA) (Vaswani et al., 2017) with Grouped Query Atten- tion (GQA)"
- Multimodality: Presence of multiple valid behaviors/actions for the same context, requiring distributional modeling. Example: "the inherent multimodality of human-collected demonstrations"
- Next-token prediction objective: Training objective where the model predicts the next token in a sequence. Example: "using a next-token prediction objective."
- Open-vocabulary tasks: Tasks specified by free-form language without a fixed, predefined set of labels. Example: "open-vocabulary tasks"
- Residual Vector Quantization (RVQ): Hierarchical vector quantization that encodes residuals across multiple codebooks for efficient discretization. Example: "encode the continuous robot actions into dis- crete tokens with Residual Vector Quantization (RVQ)"
- Scaling laws: Empirical relationships describing how performance scales with model size, data, and compute. Example: "Fig. 5 shows the scaling law curves of RDT2"
- Sim-to-real gap: Performance drop when transferring policies from simulation to real-world due to modeling mismatches. Example: "plagued by a significant sim-to-real gap"
- Single-step generator: A distilled generator that produces actions from noise in one forward pass. Example: "convert the expert policy trained in Stage 2 into a single-step generator."
- System identification: Estimating physical parameters/models of a system from data for accurate control. Example: "traditional control methods due to the difficulty of physical modeling and system identifica- tion"
- Teleoperation: Controlling a robot remotely by a human operator to collect demonstrations. Example: "Traditional data collection through teleoperation (Zhao et al., 2023; Fu et al., 2024) is often prohibitively expensive"
- Universal Manipulation Interface (UMI): Handheld, embodiment-agnostic device for collecting robot manipulation demonstrations. Example: "The Universal Manipulation Interface (UMI) (Chi et al., 2024) provides an embodiment-agnostic, handheld device"
- Vision-Language-Action (VLA) models: Models integrating visual input, language instructions, and action outputs for robot control. Example: "Vision-Language-Action (VLA) models represent a promis- ing paradigm for achieving generalized embodied intelli- gence"
- Vision-LLM (VLM): Models jointly trained on images and text to learn aligned representations and reasoning. Example: "built upon a 7B parame- ter VLM"
- Zero-shot (generalization/deployment): Applying a model to new tasks or embodiments without task-specific fine-tuning. Example: "zero-shot deployment on novel embodiments"
Collections
Sign up for free to add this paper to one or more collections.