Papers
Topics
Authors
Recent
Search
2000 character limit reached

SNEI Dataset for Social Robot Navigation

Updated 20 December 2025
  • SNEI is a comprehensive, human-annotated vision-language dataset featuring 40,000 VQA pairs across five social reasoning levels.
  • It systematically maps visual perception through prediction, chain-of-thought reasoning, final actions, and explanations to benchmark advanced VLMs.
  • The dataset supports training socially aware robots in real-world public environments, enabling explainable and compliant navigation behaviors.

The Social robot Navigation via Explainable Interactions (SNEI) dataset is a comprehensive, human-annotated vision-language resource designed to enable and benchmark human-like reasoning for socially compliant navigation in robots. SNEI consists of 40,000 Visual Question–Answer (VQA) pairs, systematically labeled across five levels of social reasoning, and is drawn from real-world, unstructured, crowded public spaces. The dataset provides an explicit mapping from perception through prediction, reasoning, and action, supporting the training and evaluation of vision–LLMs (VLMs) for open-ended, socially aware robot behavior (Payandeh et al., 2024).

1. Scope and Rationale

The primary motivation for SNEI is to address the limitations of existing social navigation systems, which typically rely either on hand-crafted rules or direct imitation of human trajectories. These approaches fail to capture the chain of reasoning that underlies socially compliant actions in dynamic, human-populated environments. SNEI hypothesizes that if a robot can articulate what it perceives, its prediction of human actions, and the rationale for its decisions—all in human-interpretable language—its behavior will become more socially acceptable and transparent.

Data is sourced from the SCAND corpus, involving a mobile robot in genuinely unstructured public venues such as university corridors, cafeterias, museum passageways, and service lines. The interaction scenarios are selected to span diverse crowd densities and challenging social contexts where implicit norms (e.g., yielding, queuing, passing in bottlenecks) are critical for safe and acceptable robot conduct.

2. Data Collection and Annotation Methodology

The SNEI dataset is constructed from several hundred hours of first-person RGB video captured via a mobile robot equipped with a stereo forward-facing camera rig and odometry. From this corpus, 2,000 unique frames were manually selected to represent distinct human–robot interaction challenges, targeting balanced coverage across densities (from empty to dense) and encounter types (head-on, overtaking, side-passing, joining a queue).

Each selected image is annotated by three independent human annotators under strict guidelines. Annotation consists of generating five VQA pairs per image, each targeting a distinct reasoning category:

  1. Perception: Explicit scene description (colors, locations, poses)
  2. Prediction: Forecast of subsequent human behaviors
  3. Chain-of-Thought (CoT) Reasoning: Stepwise justification for the social action, integrating perception and prediction
  4. Final Action: High-level robot action recommendation (e.g., “Stop and wait”)
  5. Explanation: Narrative tying together scene, action, and rationale

Disagreement among annotators is resolved by senior reviewers. To ensure reliability, inter-annotator agreement is measured using Cohen’s κ on a 5% subset, yielding κ ≈ 0.82 for perception, κ ≈ 0.78 for prediction, κ ≈ 0.74 for CoT, κ ≈ 0.85 for action, and κ ≈ 0.71 for explanation.

3. Dataset Structure and Statistical Properties

SNEI is split into training, development, and test sets in an 80/10/10 ratio, resulting in 32,000 training, 4,000 development, and 4,000 test VQA pairs. The five reasoning categories are equally represented, with 8,000 questions per category.

Key dataset statistics:

  • Total unique images: 2,000
  • Total questions/answers: 40,000 each
  • Average question length: ~12 tokens (perception: ~8; CoT: ~18)
  • Average answer length: ~14 tokens (final actions), ~25 tokens (explanations)
  • Distinct word types: ~1,200
  • Average CoT reasoning steps: 3.4
  • Inter-annotator agreement: See above
Category Number of Pairs
Perception 8,000
Prediction 8,000
Chain-of-Thought 8,000
Final Action 8,000
Explanation 8,000

4. Example Annotation Scenarios

SNEI supplies richly contextualized language annotations:

  • Perception example:

Scene: Narrow hallway, red pillar left, person approaching. Q: “What obstacles and agents do you see, and where are they?” A: “A red pillar is on my left, about one meter ahead. A person wearing a blue shirt is walking toward the robot at the far end of the corridor.”

  • Chain-of-Thought/Action/Explanation example:

Q: “Given your perception and prediction, describe your reasoning and the action you should take.” CoT Steps: (1) Perceive narrow corridor and pillar; (2) Person is approaching; (3) Proceeding may cause collision; (4) Social norm advises yielding. Final Action: “Stop and wait on the right side until the person has passed.” Explanation: “Because the corridor is too narrow for both to pass simultaneously and the person is approaching, I stop and stand aside so they can go first, ensuring comfortable social interaction.”

Such structured multi-level annotation enables training and evaluation of VLMs for not just perception, but also social reasoning, action selection, and explainability in robot navigation.

5. Evaluation Protocols and Benchmarking

Two key evaluation metrics are defined:

  • Overall Accuracy: For single-label answers,

Acc=1Ni=1N1[y^i=yi]\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N 1[\hat{y}_i = y_i]

  • Human-Judge Scoring: For open-ended/free-form responses,

Score=1Mj=1Msj\mathrm{Score} = \frac{1}{M} \sum_{j=1}^M s_j

where sj{1,...,5}s_j \in \{1, ..., 5\} and M=15M = 15 human judges.

Performance of Social-LLaVA (a VLM fine-tuned on SNEI), GPT-4V, and Gemini was compared by human judges on 50 held-out VQAs:

Category GPT-4V Gemini Social-LLaVA
Perception 3.11 3.45 4.00
Prediction 3.18 3.87 4.06
Chain-of-Thought 3.41 3.79 4.08
Action 2.77 3.46 4.19
Explanation 3.16 3.66 3.95

Social-LLaVA, fine-tuned specifically on SNEI, consistently outperformed generalist VLMs, demonstrating the utility of the dataset for enabling human-like, socially calibrated robot actions.

6. Applications and Known Limitations

Use cases:

  • Fine-tuning VLMs for navigation in service, delivery, and guide robots in socially complex environments.
  • Benchmarking open-vocabulary spatial and social reasoning in VLMs across the full perception-prediction-action-explanation pipeline.
  • Developing explainable navigation agents capable of justifying their decisions to human users.

Limitations:

  1. SNEI uses single RGB frames; depth information, trajectory, and video sequences are not yet included. Incorporating temporally and spatially richer data would strengthen distance and motion reasoning capabilities.
  2. The dataset reflects social conventions primarily from the observed context; other cultural or regional norms may diverge. Gathering localized supplements or leveraging domain adaptation is advised.
  3. While 2,000 scenarios encompass many typical cases, rare or adversarial edge situations are underrepresented; ongoing annotation and active learning are recommended to broaden coverage.
  4. Despite high inter-annotator agreement, drift and subjective bias cannot be eliminated; continuous guideline calibration and adversarial validation are necessary.
  5. Final actions are rendered in abstract, high-level language—mapping these to low-level robot control commands is an open research challenge and requires auxiliary policies or mapping layers.

7. Significance and Future Directions

SNEI represents the first large-scale, systematically annotated, human-curated VQA corpus dedicated to social robot navigation. It explicitly encodes the reasoning steps bridging visual perception and socially compliant action, with the added dimension of explainability through natural language. Its five-layer annotation scheme provides a granular dataset for evaluating and training VLMs in roles demanding fluid, human-aware interaction in unconstrained, populated spaces.

Future dataset expansions may include multimodal sensory integration (video, depth, audio), broader cultural representation, increased scenario diversity, and closed-loop validation in real-world robotic deployments. SNEI serves as both a benchmark and a resource for advancing the development of autonomous agents that must share space and norms with humans in public environments (Payandeh et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SNEI Dataset.