From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

Published 3 Feb 2026 in cs.HC, cs.CL, cs.ET, and cs.IR | (2602.03059v1)

Abstract: We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Speech-to-Spatial, an end-to-end system that transforms ambiguous verbal instructions into spatially-grounded AR guidance, enhancing task efficiency.
It employs a multi-stage pipeline with LLM-based parsing and an object-centric relational graph to resolve diverse linguistic reference patterns.
Empirical results demonstrate significant reductions in task completion times and cognitive workload, with summary overlays outperforming conventional audio-only guidance.

Speech-to-Spatial: From Verbal Instructions to Spatially-Grounded AR Guidance

Introduction and Motivation

Speech-driven remote assistance is widely adopted in scenarios where direct visual or embodied collaboration is infeasible. However, spoken instructions are often inherently ambiguous due to under-specified referential expressions. Traditional systems mitigate this ambiguity through manual visual annotations, gestures, or gaze cues—strategies that introduce operational overhead or demand specialized hardware. "From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality" (2602.03059) proposes Speech-to-Spatial, an end-to-end framework that seamlessly converts verbal remote-assistance instructions into spatially-grounded AR guidance, aiming to resolve referent ambiguity solely based on speech.

The approach is motivated by a formative study revealing four dominant linguistic reference patterns in remote instructions: Direct Feature, Relational, Memory-based, and Chained referencing. Speech-to-Spatial operationalizes these patterns by parsing utterances, constructing an object-centric relational graph, and generating persistent AR visual indicators that clarify the intended referent and associated actions in real time.

Framework Architecture and Technical Pipeline

The system architecture centers around a multi-stage pipeline: speech transcription, linguistic attribute extraction, referential graph construction, semantic reasoning, and AR visualization. Speech input is parsed by LLMs into a structured format, segmenting target objects, anchors, features, relational language, and temporal cues (Figure 1). The framework builds a relational graph mapping objects, spatial relationships, and interaction history, encoding multi-dimensional attributes—space, time, intent, and action—per object (Figure 2).

Figure 3: End-to-end pipeline: Speech input is parsed, attributes are extracted, a relational graph built, and AR indicators rendered for referent grounding.

Figure 1: Attribute parsing: Verbal instructions are structured for downstream reasoning via LLM extraction.

Figure 2: Object-centric relational graph: Each object maintains a node linked with intra- and inter-object attributes supporting referential disambiguation.

The reasoning backend, leveraging semantic similarity and LLM-based compositional inference, handles variable and chained expression resolution. Viewpoint-aware candidate filtering and occlusion culling optimize candidate selection, while prior interaction traces are retained per object to support memory-based references.

Visual indicators (e.g., directional arrows, concise instructional overlays) are anchored in AR directly above resolved objects, providing actionable guidance that persists until task completion. Summarization modules distill lengthy utterances into concise directives, minimizing cognitive overhead.

Empirical Evaluation: Task Performance and Usability

A controlled user study with 18 participants compared the Speech-to-Spatial system—with both full and summarized transcription overlays—to a conventional audio-only baseline. Participants performed locate and move tasks requiring disambiguation of target cubes among distractors under different guidance modalities.

Key quantitative findings include:

Task completion time: For locate tasks, median times (Summary: $3.25s$, Full: $4.08s$, Audio: $4.33s$) showed statistically significant reduction for summary guidance ( $p < .001$ ). For move tasks (Summary: $6.33s$, Full: $7.94s$, Audio: $9.31s$), summary guidance again demonstrated superior efficiency over both full and audio conditions.
Accuracy: Move tasks showed significant gains for summary guidance (Summary: $73.1\%$ , Audio: $64.4\%$ ; $p < .020$ ).
Perceived difficulty and confidence: Summary and full conditions reduced cognitive workload and increased confidence, with summary guidance yielding significantly higher confidence ratings.
Effect of reference patterns: Largest efficiency gains arise with relational, memory-based, and chained reference types—patterns that traditionally yield maximum ambiguity in speech-only setups.
Figure 4: Median task completion times per referencing pattern demonstrate significant reductions for Speech-to-Spatial guidance, especially for relational and memory-based utterances.

Qualitative feedback underscored the preference for concise visual directives, with 79% of participants rating summary mode highly, citing reduced memory burden and streamlined actionability.

Use Cases and Applications

Speech-to-Spatial is demonstrated in three real-world collaborative AR scenarios:

Remote maintenance: Verbal identification of components is disambiguated and visually annotated, reducing manual effort and iterative micro-guidance.
Indoor navigation: Spoken instructions mapping routes and landmarks are spatially visualized, eliminating dependence on users' mental models of the environment.
Personal AI assistant: Ambiguous verbal queries to AI agents are visually anchored, mitigating misinterpretation and enhancing transparency.
Figure 5: Three use case scenarios: (A) Remote Maintenance; (B) Indoor Navigation; (C) Personal AI assistant.

Discussion, Limitations, and Future Directions

Speech-to-Spatial demonstrates robust improvements in instruction clarity, efficiency, and user satisfaction under speech-only constraints, but several technical limitations remain:

Language coverage: Current graph representation primarily supports object-centered referencing; broader integration of view-centered and environment-centered frames requires advanced multimodal fusion and geometric reasoning.
Reasoning complexity: Chained and ordinal references ("second to the right of" or nested constructs) expose limitations in multi-hop and global relationship inference.
Visual guidance design: The impact of overlay modality (2D/3D, icons/arrows) and display fidelity on usability warrants further investigation.
Real-world robustness: Assumptions of spatial coordinate synchronization and reliable object localization must be relaxed for scalable deployment in unconstrained environments.
User agency: Balance between concise summarization and semantic completeness needs refinement to accommodate user preferences and task complexity.

Future work should extend system capabilities to support multi-turn interactions, adaptive modality switching, and deeper graph-based spatial reasoning, as well as more comprehensive user studies in authentic remote assistance workflows.

Conclusion

Speech-to-Spatial introduces a speech-driven referent disambiguation framework operationalizing linguistic reference patterns via an object-centric relational graph for grounding utterances in AR environments. Its empirical validation demonstrates measurable gains in task efficiency, comprehension, and cognitive load reduction over conventional verbal guidance. The system provides a practical bridge from ambiguous speech to visually explainable, actionable AR assistance, with potential implications for next-generation multimodal AI agents in collaborative and assistive contexts.

Figure 3: End-to-end pipeline of Speech-to-Spatial translating speech instructions into persistent AR indicators for spatial grounding.

Markdown Report Issue