DROID: In-The-Wild Robot Manipulation Dataset

Updated 16 January 2026

The paper introduces DROID, a large-scale in-the-wild dataset that captures 76,000 teleoperated trajectories across diverse real-world scenes.
The dataset utilizes synchronized multi-view cameras, crowd-sourced language annotations, and standardized teleoperation protocols to cover 86 manipulation tasks.
Experimental results show that DROID-augmented policies significantly boost both in-distribution and out-of-distribution success compared to existing benchmarks.

DROID (Distributed Robot Interaction Dataset) is a large-scale, diverse, and high-quality in-the-wild robot manipulation dataset designed to address the limitations in scale and diversity of prior robot demonstration corpora. Collected over 12 months across North America, Asia, and Europe, DROID comprises 76 000 human-teleoperated demonstration trajectories spanning 350 hours of interaction, 564 unique real-world scenes, and 86 distinct manipulation tasks. This dataset makes broad coverage of environments, camera viewpoints, and object categories possible, and has proven utility as a benchmark for learning robust, generalizable robot manipulation policies. DROID is distributed under a permissive license and includes pre-trained policies and full hardware/software setup documentation (Khazatsky et al., 2024, Jiang et al., 2024).

1. Scope and Content of the Dataset

DROID’s design targets both breadth and realism in manipulation data. Its key characteristics are:

Scale and Diversity: 76 000 successful teleoperated trajectories (≈ 350 h interaction) recorded in 564 distinct indoor scenes across 52 buildings. Environments include home kitchens, laboratories, offices, living rooms, bedrooms, bathrooms, and hallways.
Task Distribution: 86 unique manipulation verbs (using de-duplication and normalization of natural language instructions), capturing a long-tail task distribution from simple pick-and-place to complex multi-stage cooking and cleanup routines.
Objects: 125 object categories, spanning utensils, food packaging, tools, electronics, drawer handles, clothing, and more, supporting generalization to everyday scenarios.
Physical setup: 18 identical Franka Emika Panda 7-DoF robotic arms, each equipped with a Robotiq 2F-85 gripper, operated by 50 human teleoperators at 13 institutions.
Sensing modalities: Three synchronized stereo RGB cameras per scene (two Zed 2 table-mounted, one Zed Mini wrist-mounted) with complete scene-specific extrinsic calibration. Data includes high-frequency (15 Hz) robot joint positions/velocities, end-effector pose/velocity, and gripper opening.
Language annotations: Each trajectory is paired with 1–3 crowd-sourced natural language instructions, further processed for verb de-duplication and object label extraction.

2. Data Collection and Annotation Methodology

The data acquisition process involved a standardized, portable hardware/GUI pipeline:

Scene setup: A height-adjustable desk hosted the Franka arm, all cameras, control electronics, and a GUI laptop for interaction orchestration.
Calibration and Task Entry: For each scene, human collectors performed camera calibration (checkerboard for extrinsics) and specified task lists, either free-form or via prompt suggestions.
Teleoperation Protocol: Tasks were allocated per episode using random sampling. Teleoperation employed Meta Quest 2 controllers for direct 6-DoF end-effector and gripper control. Human-in-the-loop operation guaranteed safety.
Scene Augmentation: Periodic changes such as moving the robot base/cameras, altering lighting, or shuffling objects were employed to maximize environmental diversity.
Data logging: Robot/camera streams and metadata (collector ID, scene ID, timestamps) were stored in ROS bag format; success/failure was indicated post-episode.
Annotation pipeline: Natural language instructions were crowd-labeled (tasq.ai) and processed using spaCy and GPT-4 for linguistic normalization and object extraction. No frame-by-frame segmentation or exhaustive human object labeling was performed.

Filtered data for representation-centric work (Jiang et al., 2024) excluded trajectories shorter than 40 steps or lacking substantial language instructions, yielding a subset of 36 000 trajectories primarily for self-supervised pre-training.

3. Dataset Statistics and Comparative Analysis

DROID distinguishes itself both in absolute scale and in its long-tail coverage of environments and tasks.

Dataset	Trajectories	Unique Verbs	Scenes	Camera Calib.	Collection
RoboNet	162 000	n/a	10	Yes	scripted
RT-1	130 000	2	2	Yes	human teleop
BridgeData V2	60 100	82	24	Yes	85% human/mixed
RH20T	13 000	33	7	Yes	human teleop
DROID	76 000	86	564	Yes	human teleop

Major diversity and coverage metrics:

Verb distribution entropy: For verbs $\{v\}$ ,

$H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.$

DROID: $H_{\rm verbs} \approx 4.2$ nats (vs. BridgeData V2: $\sim 3.8$ ), confirming greater task diversity.

Object and scene diversity: 125 object labels, 10 scene types, 564 unique scenes, and 1 417 camera viewpoints ( $<$  50 in other datasets).
Interaction-location workspace coverage:

$C = \frac{|\{\,\mathbf{x}_i\,:\,\mathbf{x}_i \in \mathcal{W},\,i=1\ldots N\}|}{\text{Vol}(\mathcal{W})}$

where $\mathbf{x}_i$ is the location of the first gripper close in trajectory $i$ . DROID: $C \sim 80\%$ of reachable workspace, significantly broader than table-only datasets ( $\sim 30\%$ ).

Summary statistics from (Jiang et al., 2024) include trajectory lengths (median $H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.$ 0 120 frames, min = 40), a balanced open/close distribution for the gripper action, and object/action language diversity (top object labels: “cup”, “bottle”, “box”, “drawer”, “tool”; verbs: “pick”, “place”, “open”, “move”, “close”).

4. Policy Learning and Representation Application

DROID serves as both a supervised and self-supervised learning substrate:

Policy learning architecture (Khazatsky et al., 2024): Inputs comprise dual 128×128 RGB streams (ResNet-50), DistilBERT instruction encoding, and gripper state; fused into an MLP followed by a diffusion U-Net head for multistep trajectory prediction. The training objective adopts standard denoising diffusion loss over action sequence predictions,

$H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.$ 1

Augmentation includes color jitter, random cropping, and weight EMA; pre-trained visual/language backbones are fixed.

Representation pre-training (MCR framework) (Jiang et al., 2024): Two-camera RGB streams plus proprioceptive state and delta actions provide a manipulation-centric representation. MCR employs a novel contrastive loss to align visual and proprioceptive/action dynamics, alongside a behavior cloning–like policy loss and a time contrastive loss, supporting effective downstream policy transfer.

5. Experimental Evaluation and Generalization

Policy training with DROID data demonstrates marked gains in both in-distribution (ID) and out-of-distribution (OOD) task performance. Evaluation spans six manipulation tasks (e.g., Close Waffle Maker, Place Chips on Plate, Cook Lentils) across four real environments. Success rates (% ± std):

Method	In-distribution	Out-of-distribution
No co-training	52 ± 4	28 ± 5
+ Open-X (OXE)	64 ± 3	46 ± 4
+ DROID	76 ± 2	63 ± 3

DROID-augmented policies improve ID success by +22 percentage points and OOD by +17 points versus Open-X-Embodiment. In OOD trials (novel distractors/objects/viewpoints), policies co-trained with DROID sustain $H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.$ 2 63% success, with qualitative rollouts showing precise, smooth multi-step execution.

Ablation studies manipulating scene diversity (contrasting 20 most frequent scenes versus $H_{\rm verbs} = -\sum_{v} p_v \log p_v\,,\quad p_v = \frac{\#\,\text{traj with verb }v}{76{,}000}.$ 3200 random scenes for constant data volume) confirmed that broader scene variety, beyond dataset scale alone, directly improves OOD generalization (48% vs. 58% OOD success).

6. Comparison to Existing Datasets

Relative to other large-scale manipulation datasets (e.g., RoboNet, RT-1, BridgeData V2, RH20T), DROID exhibits:

Order-of-magnitude increase in scene count (564 vs. 10–24).
Expanded verb and task repertoire (86 vs. 2–82), enhancing coverage of real-world manipulation behaviors.
All-human, in-the-wild data acquisition rather than scripted or hybrid collection protocols.
Fully calibrated multi-view camera geometry and broad action/state coverage within the robot’s manipulable workspace.
Inclusion of crowd-labeled natural language, increasing utility for instruction-conditioned learning.

These characteristics position DROID as the prevailing benchmark for learning from real-world, unstructured robot demonstrations at scale.

7. Limitations and Prospective Directions

DROID’s scale and realism are accompanied by several limitations:

Embodiment constraint: All data was collected with a single robot type (Franka Panda). A plausible implication is generalization to novel manipulators/end-effectors may require additional data or transfer approaches.
Annotation granularity: Only 1–3 instructions per trajectory; richer, temporally grounded and multi-step annotations are absent.
Lack of depth sensing: The primary representation corpus (Jiang et al., 2024) utilizes only RGB data.
No explicit train/val/test splits provided in some works; downstream users may need to define splitting strategies.

Ongoing and future research directions include unsupervised representation learning on DROID, simulation-to-real transfer leveraging scene diversity, zero-shot policy adaptation to new robots, richer annotation acquisition for planning, and active data curation to surface rare or challenging behaviors (Khazatsky et al., 2024, Jiang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset (2024)

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.