Papers
Topics
Authors
Recent
Search
2000 character limit reached

DROID: In-The-Wild Robot Manipulation Dataset

Updated 16 January 2026
  • The paper introduces DROID, a large-scale in-the-wild dataset that captures 76,000 teleoperated trajectories across diverse real-world scenes.
  • The dataset utilizes synchronized multi-view cameras, crowd-sourced language annotations, and standardized teleoperation protocols to cover 86 manipulation tasks.
  • Experimental results show that DROID-augmented policies significantly boost both in-distribution and out-of-distribution success compared to existing benchmarks.

DROID (Distributed Robot Interaction Dataset) is a large-scale, diverse, and high-quality in-the-wild robot manipulation dataset designed to address the limitations in scale and diversity of prior robot demonstration corpora. Collected over 12 months across North America, Asia, and Europe, DROID comprises 76 000 human-teleoperated demonstration trajectories spanning 350 hours of interaction, 564 unique real-world scenes, and 86 distinct manipulation tasks. This dataset makes broad coverage of environments, camera viewpoints, and object categories possible, and has proven utility as a benchmark for learning robust, generalizable robot manipulation policies. DROID is distributed under a permissive license and includes pre-trained policies and full hardware/software setup documentation (Khazatsky et al., 2024, Jiang et al., 2024).

1. Scope and Content of the Dataset

DROID’s design targets both breadth and realism in manipulation data. Its key characteristics are:

  • Scale and Diversity: 76 000 successful teleoperated trajectories (≈ 350 h interaction) recorded in 564 distinct indoor scenes across 52 buildings. Environments include home kitchens, laboratories, offices, living rooms, bedrooms, bathrooms, and hallways.
  • Task Distribution: 86 unique manipulation verbs (using de-duplication and normalization of natural language instructions), capturing a long-tail task distribution from simple pick-and-place to complex multi-stage cooking and cleanup routines.
  • Objects: 125 object categories, spanning utensils, food packaging, tools, electronics, drawer handles, clothing, and more, supporting generalization to everyday scenarios.
  • Physical setup: 18 identical Franka Emika Panda 7-DoF robotic arms, each equipped with a Robotiq 2F-85 gripper, operated by 50 human teleoperators at 13 institutions.
  • Sensing modalities: Three synchronized stereo RGB cameras per scene (two Zed 2 table-mounted, one Zed Mini wrist-mounted) with complete scene-specific extrinsic calibration. Data includes high-frequency (15 Hz) robot joint positions/velocities, end-effector pose/velocity, and gripper opening.
  • Language annotations: Each trajectory is paired with 1–3 crowd-sourced natural language instructions, further processed for verb de-duplication and object label extraction.

2. Data Collection and Annotation Methodology

The data acquisition process involved a standardized, portable hardware/GUI pipeline:

  • Scene setup: A height-adjustable desk hosted the Franka arm, all cameras, control electronics, and a GUI laptop for interaction orchestration.
  • Calibration and Task Entry: For each scene, human collectors performed camera calibration (checkerboard for extrinsics) and specified task lists, either free-form or via prompt suggestions.
  • Teleoperation Protocol: Tasks were allocated per episode using random sampling. Teleoperation employed Meta Quest 2 controllers for direct 6-DoF end-effector and gripper control. Human-in-the-loop operation guaranteed safety.
  • Scene Augmentation: Periodic changes such as moving the robot base/cameras, altering lighting, or shuffling objects were employed to maximize environmental diversity.
  • Data logging: Robot/camera streams and metadata (collector ID, scene ID, timestamps) were stored in ROS bag format; success/failure was indicated post-episode.
  • Annotation pipeline: Natural language instructions were crowd-labeled (tasq.ai) and processed using spaCy and GPT-4 for linguistic normalization and object extraction. No frame-by-frame segmentation or exhaustive human object labeling was performed.

Filtered data for representation-centric work (Jiang et al., 2024) excluded trajectories shorter than 40 steps or lacking substantial language instructions, yielding a subset of 36 000 trajectories primarily for self-supervised pre-training.

3. Dataset Statistics and Comparative Analysis

DROID distinguishes itself both in absolute scale and in its long-tail coverage of environments and tasks.

Dataset Trajectories Unique Verbs Scenes Camera Calib. Collection
RoboNet 162 000 n/a 10 Yes scripted
RT-1 130 000 2 2 Yes human teleop
BridgeData V2 60 100 82 24 Yes 85% human/mixed
RH20T 13 000 33 7 Yes human teleop
DROID 76 000 86 564 Yes human teleop

Major diversity and coverage metrics:

  • Verb distribution entropy: For verbs {v}\{v\},

Hverbs=vpvlogpv,pv=#traj with verb v76,000.H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.

DROID: Hverbs4.2H_{\rm verbs} \approx 4.2 nats (vs. BridgeData V2: 3.8\sim 3.8), confirming greater task diversity.

  • Object and scene diversity: 125 object labels, 10 scene types, 564 unique scenes, and 1 417 camera viewpoints (<< 50 in other datasets).
  • Interaction-location workspace coverage:

C={xi:xiW,i=1N}Vol(W)C = \frac{|\{\,\mathbf{x}_i\,:\,\mathbf{x}_i \in \mathcal{W},\,i=1\ldots N\}|}{\text{Vol}(\mathcal{W})}

where xi\mathbf{x}_i is the location of the first gripper close in trajectory ii. DROID: C80%C \sim 80\% of reachable workspace, significantly broader than table-only datasets (30%\sim 30\%).

Summary statistics from (Jiang et al., 2024) include trajectory lengths (median Hverbs=vpvlogpv,pv=#traj with verb v76,000.H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.0 120 frames, min = 40), a balanced open/close distribution for the gripper action, and object/action language diversity (top object labels: “cup”, “bottle”, “box”, “drawer”, “tool”; verbs: “pick”, “place”, “open”, “move”, “close”).

4. Policy Learning and Representation Application

DROID serves as both a supervised and self-supervised learning substrate:

  • Policy learning architecture (Khazatsky et al., 2024): Inputs comprise dual 128×128 RGB streams (ResNet-50), DistilBERT instruction encoding, and gripper state; fused into an MLP followed by a diffusion U-Net head for multistep trajectory prediction. The training objective adopts standard denoising diffusion loss over action sequence predictions,

Hverbs=vpvlogpv,pv=#traj with verb v76,000.H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.1

Augmentation includes color jitter, random cropping, and weight EMA; pre-trained visual/language backbones are fixed.

  • Representation pre-training (MCR framework) (Jiang et al., 2024): Two-camera RGB streams plus proprioceptive state and delta actions provide a manipulation-centric representation. MCR employs a novel contrastive loss to align visual and proprioceptive/action dynamics, alongside a behavior cloning–like policy loss and a time contrastive loss, supporting effective downstream policy transfer.

5. Experimental Evaluation and Generalization

Policy training with DROID data demonstrates marked gains in both in-distribution (ID) and out-of-distribution (OOD) task performance. Evaluation spans six manipulation tasks (e.g., Close Waffle Maker, Place Chips on Plate, Cook Lentils) across four real environments. Success rates (% ± std):

Method In-distribution Out-of-distribution
No co-training 52 ± 4 28 ± 5
+ Open-X (OXE) 64 ± 3 46 ± 4
+ DROID 76 ± 2 63 ± 3

DROID-augmented policies improve ID success by +22 percentage points and OOD by +17 points versus Open-X-Embodiment. In OOD trials (novel distractors/objects/viewpoints), policies co-trained with DROID sustain Hverbs=vpvlogpv,pv=#traj with verb v76,000.H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.2 63% success, with qualitative rollouts showing precise, smooth multi-step execution.

Ablation studies manipulating scene diversity (contrasting 20 most frequent scenes versus Hverbs=vpvlogpv,pv=#traj with verb v76,000.H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.3200 random scenes for constant data volume) confirmed that broader scene variety, beyond dataset scale alone, directly improves OOD generalization (48% vs. 58% OOD success).

6. Comparison to Existing Datasets

Relative to other large-scale manipulation datasets (e.g., RoboNet, RT-1, BridgeData V2, RH20T), DROID exhibits:

  • Order-of-magnitude increase in scene count (564 vs. 10–24).
  • Expanded verb and task repertoire (86 vs. 2–82), enhancing coverage of real-world manipulation behaviors.
  • All-human, in-the-wild data acquisition rather than scripted or hybrid collection protocols.
  • Fully calibrated multi-view camera geometry and broad action/state coverage within the robot’s manipulable workspace.
  • Inclusion of crowd-labeled natural language, increasing utility for instruction-conditioned learning.

These characteristics position DROID as the prevailing benchmark for learning from real-world, unstructured robot demonstrations at scale.

7. Limitations and Prospective Directions

DROID’s scale and realism are accompanied by several limitations:

  • Embodiment constraint: All data was collected with a single robot type (Franka Panda). A plausible implication is generalization to novel manipulators/end-effectors may require additional data or transfer approaches.
  • Annotation granularity: Only 1–3 instructions per trajectory; richer, temporally grounded and multi-step annotations are absent.
  • Lack of depth sensing: The primary representation corpus (Jiang et al., 2024) utilizes only RGB data.
  • No explicit train/val/test splits provided in some works; downstream users may need to define splitting strategies.

Ongoing and future research directions include unsupervised representation learning on DROID, simulation-to-real transfer leveraging scene diversity, zero-shot policy adaptation to new robots, richer annotation acquisition for planning, and active data curation to surface rare or challenging behaviors (Khazatsky et al., 2024, Jiang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.