Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ego-ST Bench

Updated 2 February 2026
  • Ego-ST Bench is a benchmark suite that rigorously evaluates task-oriented spatio-temporal reasoning and grounding from a first-person perspective.
  • It measures explicit versus implicit grounding and single versus multi-object localization using over 5,000 question-answer pairs across 789 video segments.
  • The benchmark advances embodied AI by bridging perception to action for navigation, manipulation, and AR/VR applications in complex, real-world environments.

Ego-ST Bench is a benchmark suite introduced to evaluate and advance task-oriented spatio-temporal grounding and reasoning in egocentric (first-person) videos. It shifts the focus from simple object recognition and descriptive questions to complex goal-driven, spatial, temporal, and integrative spatio-temporal reasoning, reflecting the challenges faced by embodied agents and AR/VR systems in real-world environments. By formalizing and rigorously measuring embodied intelligence along multiple axes—explicit/implicit grounding, single/multi-object, and high-level route and action understanding—Ego-ST Bench provides a critical testbed and associated methodologies for contemporary multimodal LLMs.

1. Motivation and Scope

The primary motivation for Ego-ST Bench is to assess and drive progress on embodied AI agents’ ability to perform advanced spatio-temporal reasoning and grounding from an egocentric perspective. Tasks encountered by robotic agents, AR assistants, and other embodied systems frequently require understanding both spatial (“where”) and temporal (“when”) aspects of objects and events from a first-person coordinate system, not just static recognition or object-centric Q&A.

Legacy benchmarks are limited in several dimensions:

  • Descriptive focus: Most prior datasets ask only for recognition or simple localization, e.g., "Which cup is red?" rather than task-driven queries like "I'm thirsty—where can I turn on water?".
  • Object-centric bias: Traditional spatial relations are defined from the viewpoint of scene objects or in 2D, failing to capture 3D egocentric complexity.
  • Temporal shallowness: Temporally extended understanding—such as reasoning about the sequence of actions, direction changes, or integrated routes—is poorly covered.

Ego-ST Bench directly addresses these gaps by evaluating:

  • Task-oriented reasoning and bridging perception to action.
  • Explicit-Implicit duality: inferring both directly mentioned and contextually implied objects.
  • Multi-object dependencies: grounding and tracking multiple interrelated objects.
  • Spatio-temporal integration: combining 3D spatial and temporal chains for real-world embodied navigation and manipulation tasks (Wu et al., 16 Mar 2025, Xu et al., 3 Dec 2025).

2. Benchmark Structure and Task Taxonomy

Ego-ST Bench consists of multiple evaluation benchmarks, with principal divisions:

  • Route Description (Integrated 4D reasoning): Open-ended narratives requiring chaining landmarks, direction changes, and actions; evaluated in both forward and reverse (retracing) directions.
  • Direction Change Description: Identification of self-motion events, e.g., left/right/U-turn transitions.
  • Landmark Description: Spatial localization of static elements relative to the agent's trajectory.
  • Action Description: Temporal prediction and recognition of discrete action segments.

Each task type is evaluated using both multiple-choice and open-ended Q&A, with 5,000+ expert-curated QA pairs across 789 egocentric video segments from sources including SUN3D, HUJI, DoMSEV, and Aria Everyday. Reasoning is systematically tested in both forward and reverse situations, capturing challenges in both following a route and retracing steps.

ToG-Bench formalizes egocentric, task-based video grounding via three axes:

  • Task-Oriented Grounding: Localizing tools and objects required to fulfill intent-driven instructions (“where do I turn on water?” rather than “which faucet is silver?”).
  • Explicit-Implicit Duality: Answering queries using either explicitly mentioned or contextually inferred objects.
  • One-to-Many Grounding: Supporting grounding of multiple necessary objects per instruction (e.g., "Set up for a presentation" requiring {laptop, remote, projector,...}).

The dataset comprises 100 annotated ScanNet-based egocentric videos, 2,704 task-oriented instructions, and 4,194 objects across 177 functional categories. Approximately half of the instructions are implicit, and tasks with more than one target object are as frequent as single-object cases.

3. Dataset Construction, Annotation Protocols, and Data Properties

  • Sourcing: Clips sampled from six established egocentric datasets and self-collected videos, spanning both indoor and outdoor environments.
  • Preprocessing: Manual filtering of clips ensures dense, labelable spatio-temporal events.
  • Annotation:
    • 150+ hours of manual annotation for route and action descriptions.
    • Multiple-choice QA generated via seeded GPT-4o prompting, followed by human linguistic and correctness validation.
    • Stratified sampling creates balanced splits across reasoning type, direction (forward/reverse), and scene complexity.
  • Representation: 5,000+ QA pairs, with open-ended tasks graded by LLM judges for direction, landmarks, and semantic coherence.
  • Stage 1 (Instruction Generation): Frames and scene captions are ingested by a multimodal LLM (Gemini 2.5 Pro) to generate a mix of explicit and implicit instructions. Automated filters ensure feasibility and visibility of targeted tasks.
  • Stage 2 (Object Grounding and Tracking): Grounding-DINO locates each object on its first visible frame; propagation with SAM2 produces spatio-temporal tubes at 1 fps.
  • Stage 3 (Human Verification): Annotators refine tracks, correct object boundaries, temporal phases, and remove drift/hallucinations.

Key statistics:

Property ToG-Bench Value
# Videos 100
# Task Instructions 2,704
# Object Instances 4,194
Explicit:Implicit (%) 51.3 : 48.7
Single : Multi-object (%) 49.7 : 50.3
Object Categories 177

4. Evaluation Metrics and Protocols

  • Multiple-choice Accuracy: Acc=1Ni=1N1(y^i=yi)\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\hat y_i = y_i).
  • Route Description (open-ended): Graded for direction change coverage, landmark recall, and logical/semantic coherence by LLM judges, each on a 0–5 scale. Composite score is percentage of max 15 points.
  • Benchmark splits: Zero-shot (no finetuning) is the standard. No cross-validation; fixed splits for reproducibility.
  • Mean IoU (mIoU): Overlap between predicted and ground-truth bounding boxes.
  • Recall@τ (Temporal Grounding): Fraction of predictions above a given IoU threshold.
  • AP@τ (Spatial Grounding): Average precision at multiple IoU thresholds τ{0.3,0.5,0.7}\tau \in \{0.3, 0.5, 0.7\}.
  • Recognition Accuracy (Acc): Text embedding cosine similarity for category matching.
  • Stratified reporting: Explicit vs. implicit, 1/2/3+ object tasks, spatial/temporal scoring.

5. Baseline Results and Analysis

  • Pure spatial (Landmark)/temporal (Action) QA: Models achieve 60–85% accuracy.
  • Integrated (Route/Direction): Accuracies are typically under 50%, with reverse (retracing) reasoning 5–10 points lower than forward.
  • Model scale: Scaling from 7B to 72B parameters yields only marginal gains on integrated tasks.
  • Open/closed-source gap: Minimal; both types of models show similar performance on most tasks.
  • Failure modes: Chaining multiple actions and landmarks; robustness in reverse direction reasoning.
  • Explicit vs. Implicit: 15–25 point drop in task accuracy for proprietary models when the target object must be inferred.
  • Single vs. Multi-object: Even top models’ performance drops from ~96% in 1-object tasks to 75% in 3+-object scenarios; open-source models collapse below 30% for 3+ object tasks.
  • Spatial Grounding (m_vIoU): Lags behind recognition accuracy by 10–20 points, highlighting perceptual challenges.
  • Long-horizon consistency: Temporal and spatial localization accuracy degrades by 30–50% on videos exceeding 150 seconds.

Selected sample performance figures:

Model T-Acc (all) T-EAcc (explicit) T-IAcc (implicit) 1-obj T-Acc 3+-obj T-Acc
GPT-5 89.4% 98.2% 80.2% 96.1% 75.0%
Gemini 2.5 Pro 80.1% 91.5% 68.2% 90.8% 60.5%
Qwen3-VL-8B 65.1% 76.0% 53.7% 83.4% 29.0%

6. Principal Findings and Challenges

Four main challenges are exposed by Ego-ST Bench evaluations:

  1. Recognition is necessary but insufficient: MLLMs can often guess a needed object but struggle to precisely localize it in egocentric views, especially under occlusion and clutter.
  2. Implicit reasoning remains brittle: Inferring unmentioned but necessary tools for a goal (e.g., deducing “faucet” for “turn on water”) exposes significant deficits in current commonsense and intent-integration.
  3. Coordinated grounding for multiple objects is underdeveloped: Robust joint localization of all required objects for a compound instruction is rarely achieved by current models.
  4. Temporal consistency degrades rapidly: Both temporal and spatial grounding performance drop substantially in long (>150s) videos, often due to motion, occlusions, and cumulative drift.

This suggests future models will need enhanced joint spatio-temporal architectures, stronger intent-to-object reasoning modules, and improved commonsense knowledge for robust implicit grounding.

7. Impact and Future Directions

Ego-ST Bench’s comprehensive protocol, spanning explicit/implicit, single/multi-object, and 4D (3D+time) requirements, constitutes a rigorous testbed for developing and evaluating new embodied AI systems. Its findings demonstrate the current limitations of large multimodal LLMs in meeting the demands of real-world egocentric perception and interaction. Further, it sets the research agenda for closing gaps in:

  • Task-driven, context-sensitive grounding architectures.
  • Long-horizon temporal and multi-object consistency.
  • Explicit integration of external commonsense and task knowledge sources.

By establishing explicit, comparative performance benchmarks, Ego-ST Bench facilitates the robust and measurable advancement of goal-driven, embodied intelligence in complex visual environments (Wu et al., 16 Mar 2025, Xu et al., 3 Dec 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ego-ST Bench.