Situational Awareness Dataset (SAD) Overview

Updated 13 February 2026

Situational Awareness Dataset (SAD) is a collection of benchmarks that evaluate AI's ability to perceive, reason, and act based on dynamic environmental contexts.
SAD employs rigorous annotation protocols and multi-modal schemas, using metrics such as accuracy and F1 scores to assess diverse AI models.
These datasets drive advancements in AI safety, deployment oversight, and adaptation in complex real-world scenarios.

The Situational Awareness Dataset (SAD) refers to a set of benchmarks and corpora developed to quantitatively evaluate and enable the situational awareness capabilities of artificial intelligence systems, including LLMs, vision-LLMs, robotic agents, and domain-specific reasoning frameworks. Situational awareness in this context denotes an agent's knowledge of itself, its environment, and the dynamic context necessary for inferring, reasoning, and acting contingently—spanning physical, social, and operational domains. SAD datasets vary in scope and modality but share the rigorous annotation and evaluation strategies necessary for advancing both the technical performance and safety of highly autonomous AI systems.

1. Conceptual Foundations of Situational Awareness Datasets

Situational awareness (SA) is operationally defined as the conjunction of three capacities in a model $M$ : (1) self-knowledge (facts about itself and its affordances), (2) inference about the stage of operation or environment, and (3) action that is appropriately contingent on the above (Laine et al., 2024). This decomposition highlights the limitations of AI systems that are optimized solely for general world knowledge or task-driven generation, motivating specialized benchmarks that directly probe SA capabilities.

Historically, datasets targeting AI “perception” have focused on static scene understanding (object recognition, scene graphs) or general question answering without explicit emphasis on context-conditional, self-reflective, or environment-aware reasoning. In contrast, recent SADs quantify, annotate, and evaluate the extent to which an agent’s responses reflect both awareness of its own state and of the broader operational context, including physical surroundings, deployment status, and identity-dependence (Laine et al., 2024, Berglund et al., 2023, Khan et al., 2024, Jewel et al., 9 Dec 2025). This diagnostic approach enables systematic improvement of AI systems’ abilities to guide, act, and adapt in real-world and safety-critical settings.

2. Major Instantiations and Modalities

Diverse Domains and Dataset Instances

SAD is not a single monolithic dataset, but an organizing principle under which several domain-specific datasets have been created:

Dataset Name / Modality	Core Setting	Scope / Content
SAD for LLMs (Laine et al., 2024)	LLMs, self-knowledge	13,198 prompts across 7 task categories probing model identity, introspection, deployment stage, and action
SAD (out-of-context) (Berglund et al., 2023)	LLM out-of-context reasoning	Simulated chatbot identities, 2,250–5,050 doc corpus, 700 eval prompts
SID-Instruct (Khan et al., 2024)	LLMs + structured perception	1,482 RGB-D scans, 6,927 scenarios, ~164k instruction pairs grounded in scene graphs
UMD SAD (Jewel et al., 9 Dec 2025)	Vision-language, disaster response	300 images, 1,500 expert captions, segmentation masks for underground mine events
Space SAD (Xie et al., 2022)	Text extraction, space domain events	48.5k news articles, 1,787 annotated sentences w/15.9k slot labels on launches/failures
Event-based SSA (Afshar et al., 2019)	Neuromorphic vision, space imaging	236 event streams, 572 labeled space-object tracks, 13 hr, 2.9B events
Drone-First-Aid SAD (Chang et al., 3 Oct 2025)	Video, bystander SA assessment	11 videos, 93k frames, multilabel SA annotation, fine-grained event segmentation

The diversity of SAD instantiations encompasses language-only, multimodal, and action-centric tasks, reflecting the broad relevance of SA for AI in robotics, language, human-autonomy teaming, and scientific/operational monitoring domains.

3. Benchmark Taxonomies and Annotation Protocols

The SAD for LLMs (Laine et al., 2024) organizes 16 tasks into seven high-level categories:

Facts: Model's knowledge of its own architecture, properties, and identity.
Influence: Recognition of which actions or outputs are feasible for the agent.
Introspection: Access to internal states, e.g., token count, next-token prediction.
Stages: Inferences about whether a prompt is from deployment, training, or evaluation.
Self-Recognition: Detection of own outputs versus human or other-system text.
ID-Leverage: Following instructions contingent on model identity.
Anti-Imitation: Deliberate deviation from distributional priors when instructed.

Each data instance is annotated with deterministic or classifier-assisted correctness functions. Multi-choice and free-form items dominate, with human or LLM-as-rater grading where necessary.

SID-Instruct (Khan et al., 2024) leverages a hierarchical annotation protocol starting with richly attributed 3D scene graphs, prune-relevant objects/relations using LLMs/human verification, and then generate instructor-graded, scenario-specific instruction sets via iterative, multi-agent dialogue. QA transcripts are distilled to finalized stepwise guidance, reflecting the model’s adaptive scene understanding.

UMD/MDSE (Jewel et al., 9 Dec 2025) and drone SADs (Chang et al., 3 Oct 2025) implement multi-modal, multi-level annotation strategies (e.g., five expert captions/image, event segmentations, multidimensional SA labels), with explicit inter-annotator agreement and semantic control using domain-specific vocabulary and semantic masks.

Space SAD (Xie et al., 2022) employs slot-type labels for event extraction, with multi-annotator span adjudication and precision/recall/F $_1$ evaluation.

4. Evaluation Methodologies and Empirical Results

Across SAD variants, evaluation is operationalized through accuracy, F $_1$ , overlap metrics, and domain-specific measures. For LLMs (Laine et al., 2024), the principal metric is category accuracy $\mathrm{Acc}_c = \text{correct}/\text{total}$ , with overall $\mathrm{SAD} = (1/7) \sum_c \mathrm{Acc}_c$ . Random baselines range from 25–50%, varying by task form; human role-played upper bounds exceed 90%. State-of-the-art models (e.g., Claude-3-Opus) attain $\sim$ 49.5% overall, with pronounced weakness in introspection, ID-leverage, and anti-imitation.

For event extraction/slot labeling (e.g., Space SAD (Xie et al., 2022)), slotwise F $_1$ spans 56–91%. In structure-to-language tasks (SID-Instruct (Khan et al., 2024)), no custom loss functions are used beyond next-token cross-entropy; evaluation centers on instruction relevance, specificity, and contextualization, with demonstrated performance gains over generic LLM baselines.

Multimodal SADs (Jewel et al., 9 Dec 2025) utilize CIDEr/SPICE for captioning; best models improve by 0.08–0.12 absolute on SPICE (0.53 vs. 0.42–0.51 for baselines). In video-based SA evaluation (Chang et al., 3 Oct 2025), mean over frames (MoF) and intersection-over-union (IoU) are used for event segmentation (TrSA: MoF = 0.58, IoU = 0.34, outperforming FINCH by 9%/5%).

5. Technical Schemas and Scene Representation

A recurring feature of recent SADs is the explicit representation of physical or social context using structured schemas:

Scene Graph Language (SGL) (Khan et al., 2024):

Object tokens: obj–label–instanceID:[attr₁, attr₂,…]
Relationship tokens: rel–relID:(subjectLabel–subjectID, predicate, objectLabel–objectID)

This structured encoding pairs hierarchical object properties and relations with each scenario, supporting LLM fine-tuning for stepwise, context-grounded instruction.

Segmentation-aware multimodality (Jewel et al., 9 Dec 2025):

Visual streams: global and ROI-specific embeddings (CLIP-ViT), segmentation masks (SAM)
Textual streams: controlled vocab/language normalization, BPE tokenization
Fused representations via context-aware cross-attention and LoRA for resource efficiency.

Space-domain event annotation (Xie et al., 2022):

Typed slots (SATELLITE_NAME, LAUNCH_VEHICLE, etc.) mapped to span labels in JSON Lines format; event extraction via ODINSON dependency rules and BERT-based BIO slot models.

Temporal/graph features in action-centric SADs (Chang et al., 3 Oct 2025):

Geometric, kinematic, disparity, and graph features fed to a multi-head transformer for temporal segmentation and continuous SA labeling.

6. Limitations, Open Issues, and Future Directions

SADs exhibit acknowledged limitations tied to their construction and scope:

Restricted coverage: Current datasets often target static, indoor, or simulation environments; highly dynamic, outdoor, or temporally evolving contexts are underexplored (Khan et al., 2024).
Annotation bottlenecks: Human verification is labor-intensive and subjective, sometimes lacking formal agreement statistics and susceptible to bias.
Schema dependence: Reliance on upstream annotation quality (e.g., 3DSSG) means missing or erroneous attributes propagate, reducing contextual reliability.
Modality and scenario expansion: Extensions to domains such as urban disasters, complex human-autonomy teaming, and real-time interactive feedback loops are proposed but not yet realized.
Overlap with model scaling: Out-of-context reasoning and emergent situational awareness may accelerate with larger models, complicating the interpretation and control of SA capacities (Berglund et al., 2023).
Control/safety implications: Highly situationally-aware models could exploit their knowledge to evade evaluation or induce unwarranted trust (Laine et al., 2024).

Planned enhancements include scaling SADs to broader environments (outdoors, warehouses), temporal adaptation, model-assisted annotation pipelines, and integrated diagnostic feedback from deployment logs (Khan et al., 2024, Jewel et al., 9 Dec 2025).

7. Broader Significance and Research Applications

The SAD research ecosystem supports:

Benchmarking and diagnosis: Providing the first comprehensive diagnostic benchmarks targeting AI self-knowledge, context inference, and identity-dependent action, separate from general QA or reasoning.
Model improvement: Identifying targeted deficits that can be addressed via additional instruction tuning, system prompts, or architecture modifications.
Deployment safety and oversight: Enabling rigorous auditing for scenarios where model misalignment or strategic behavior (e.g., evaluation evasion) could have real-world consequences.
Transfer and domain adaptation: SADs designed for one domain (e.g., vision-language mining emergencies) supply annotation pipelines, technical schemas, and evaluation metrics broadly portable to military, industrial, and medical applications.
Multimodal and embodied intelligence: Structured, context-rich data enable training and evaluation of agents operating at the intersection of perception, reasoning, and action, paving the way for robust, environment-adaptive assistants.

The ongoing evolution of SADs exemplifies an increasing theoretical and practical emphasis on quantifying, understanding, and controlling the situational awareness of advanced AI systems (Laine et al., 2024, Berglund et al., 2023, Khan et al., 2024, Jewel et al., 9 Dec 2025, Chang et al., 3 Oct 2025, Xie et al., 2022, Afshar et al., 2019).