SCAND: Socially CompliAnt Navigation Dataset
- SCAND is a large-scale multi-modal dataset for socially compliant robot navigation, integrating human teleoperation demonstrations with detailed sensor logs and social context annotations.
- The dataset includes rich sensor data from 3D LiDAR, RGB/depth cameras, IMU, and more, collected in diverse indoor and outdoor environments to support robust model training.
- SCAND offers annotated action commands, scenario tags, and benchmark metrics that facilitate the evaluation of navigation safety, social compliance, and decision ambiguity.
The Socially CompliAnt Navigation Dataset (SCAND) is a large-scale, multi-modal corpus designed to support learning, evaluation, and benchmarking of socially compliant robot navigation policies. SCAND contains human teleoperation demonstrations on real robots in rich human-populated indoor and outdoor settings, with detailed sensor logs, ground-truth actions, and annotated social interaction contexts. Multiple SCAND variants exist, each tailored for different navigation research paradigms; the canonical SCAND release comprises 138 human-driven trajectories (8.7 hours, ∼40 km) on two robotic platforms, while more recent forms emphasize action ambiguity and human social norms through dialogue and ranked decision sets.
1. Dataset Origins and Motivations
SCAND was introduced to address the critical lack of large-scale, first-person demonstration data for the development of social robot navigation, particularly for imitation learning and inverse reinforcement learning methods (Karnan et al., 2022, Raj et al., 2023). Robotic navigation in human environments demands not only collision avoidance but also contextually compliant behaviors, such as yielding, overtaking, and group navigation, which are challenging to encode via analytic reward functions. Human teleoperation was selected as the most data-efficient method for capturing authentic social interactions, making SCAND a foundational resource for policy learning that reflects actual human norms and navigation strategies.
2. Data Collection: Platforms, Modalities, and Scenarios
SCAND’s canonical release comprises demonstrations performed by four human operators using the Boston Dynamics Spot quadruped and Clearpath Jackal wheeled robot. Data were collected in real-world conditions on the University of Texas at Austin campus, spanning a variety of indoor (corridors, lobbies) and outdoor (sidewalks, intersections, stadium crowds) environments (Karnan et al., 2022, Raj et al., 2023). Each robot was equipped with a tightly synchronized suite of sensors:
- 3D LiDAR: Velodyne VLP-16 at 10 Hz, yielding high-density point clouds (x, y, z, intensity).
- RGB and depth cameras: Azure Kinect RGB (1280×720, 20 Hz), stereo vision, and multiple monocular streams.
- Odometry: Jackal wheel odometry (30 Hz); Spot visual odometry and kinematics (20 Hz).
- IMU: 6-DOF inertial units (16–100 Hz).
- Joystick commands: Operator-issued (v, ω) at 10 Hz.
- Synchronization: All sensing streams timestamped on a single ROS clock, with static transform chains recorded for cross-modal data fusion.
Each trajectory is further annotated with high-level “social tags”—such as “With Traffic,” “Against Traffic,” “Intersection,” “Narrow Doorway,” “Street Crossing,” and “Passing Conversational Groups”—summarizing the social context encountered. The dataset encompasses a diverse array of human-robot interaction scenarios:
| Scenario Type | Fraction of Data |
|---|---|
| Frontal approach | 22 % |
| Intersection | 18 % |
| Following | 16 % |
| Overtaking/passing | 15 % |
| Narrow doorway | 12 % |
| Group formations | 9 % |
| Other (waiting, turns) | 8 % |
Each demonstration captures both a global navigation plan (sequence of waypoints) and the local action stream (teleoperator joystick commands), providing a complete record of socially contextualized navigation behavior (Raj et al., 2023).
3. Dataset Structure, Formats, and Access
SCAND adopts a consistent per-trajectory file structure:
1 2 3 4 5 6 7 8 9 10 11 12 |
/SCAND/
/jackal/ or /spot/
/trajectory_XXX/
trajectory_XXX.bag # Raw ROS bag
imu.csv # IMU time series
lidar.pcd # Sample LiDAR cloud
images_front/*.png # Kinect RGB frames
images_stereo/ # Jackal stereo pairs
joystick.csv # (v, ω) actions
odom.csv # Odometry
social_tags.json # Social context tags
transforms.yaml # Static transforms |
- ROS bags store all streams for extensible replay and extraction.
- CSV exports are provided for core modalities (odometry, IMU, joystick).
- Images are consistently indexed for vision models.
- Metadata (tags, transforms) supports downstream sensor fusion and semantic analyses.
Data splits for benchmarking include 100 trajectories for training, with 19 each for validation and test; an additional out-of-distribution (OOD) test set comprises 50 expert-curated “minigame” scenarios focused on challenging social contexts (Raj et al., 2023).
4. Annotation Protocols, Social Contexts, and Action Representation
SCAND leverages task-level and scenario-level annotation strategies:
- Scenario tags: Each trajectory and timestamp are labeled via a fixed set of scenario types reflecting canonical social navigation interactions (e.g., intersection crossing, following, overtaking).
- Action encoding: All operator commands are recorded as (v, ω) pairs for continuous planning; discrete high-level primitives (e.g., Move Forward, Turn Left/Right, Stop) are specified in derivative datasets (Wang et al., 25 Dec 2025).
- Ambiguity and consensus: Recent SCAND-based datasets explicitly model real-world action ambiguity, with each sample paired with dual human-annotated ranked sets of feasible, socially preferred actions. Annotation involves hierarchical consensus protocols that prune infeasible actions, apply social norm filters (e.g., interpersonal distance preservation), and rank by motion efficiency (Wang et al., 25 Dec 2025).
The multi-modal, ambiguity-aware SCAND variant employs a three-turn language interaction (scene description, motion prediction, ranked action request) coupled with a single RGB scene frame. Six discrete commands define the permissible action space, supporting both classification and instruction-following paradigms.
5. Benchmarking, Baseline Models, and Evaluation Metrics
SCAND serves as the backbone for imitation learning (IL), behavior cloning (BC), and hybrid geometric-learning planning benchmarks:
- Behavior Cloning Baseline: Inputs comprise BEV LiDAR images (CNN encoder), recent odometry/IMU states (MLP encoder), and prior global plans. Models predict both local joystick action sequences and future waypoints via MSE loss (Karnan et al., 2022).
- Hybrid Planning: A simple gating classifier switches between classical geometric planners (“move_base”) and BC, exploiting the empirical result that geometric planners are socially compliant in the majority of cases, while learned models excel at the harder scenarios (Raj et al., 2023).
- Evaluation Metrics: Metrics systematically quantify trajectory-level and action-level social compliance.
- Global-Plan Compliance: Hausdorff distance between planned and demonstrated waypoints.
- Local-Plan Compliance: L2-norm over -step action sequences.
- Compliance ratio: .
- Other metrics: Social distance, comfort violations, efficiency, and action agreement statistics (Pred@1, APG, MAA, Error Rate) in ambiguity-aware settings (Wang et al., 25 Dec 2025).
Policy performance on SCAND is quantified via compliance ratios and human-rated perceptions of “social compliance” and “safety” in controlled field trials (Karnan et al., 2022, Raj et al., 2023).
6. Variants Emphasizing Social Ambiguity and Dialogue
A recent mutation of SCAND (Wang et al., 25 Dec 2025) introduces explicit multi-action annotation and language-based scene interpretation. Each sample provides:
- A single RGB image (512×512, no depth),
- Three-turn user–assistant dialogue describing the social context, inferring human intent, and requesting ranked navigation actions,
- Ranked sets of 1–6 discrete action primitives, dual-annotated by trained raters and adjudicated by social-norm-oriented consensus,
- Five custom metrics—Pred@1, Pred@n, All-Pred-in-GT (APG), Multi-action Accuracy (MAA), and Error Rate (ER)—to capture decision ambiguity, precision, and safety.
This configuration prioritizes study of real-world social uncertainty, context sensitivity across indoor/outdoor and crowd density conditions, and fast evaluation of reasoning over ambiguous social situations.
7. Limitations and Prospects for Extension
Key limitations of SCAND stem from its operative environment (single campus, specific cultural norms), the lack of frame-level human detection/tracking, and its focus on the navigation task rather than full-scene semantic perception (Karnan et al., 2022, Raj et al., 2023). The dataset is primarily collected in daytime, fair-weather conditions, limiting exposure to adverse operational extremes. Explicit proxemic measurements and rich human intention/gaze annotations are absent but proposed as future directions. SCAND’s highly structured sensor streams and scenario taxonomy, however, support ongoing extension to multi-city, cross-cultural, multi-modal, and sim-to-real transfer research.
SCAND remains a foundational benchmark for socially compliant navigation, enabling robust comparative evaluation of geometric, learning-based, and hybrid planning systems under authentic and diverse human–robot interaction scenarios.