Conversational Orientation Reasoning: Egocentric-to-Allocentric Navigation with Multimodal Chain-of-Thought

Published 20 Sep 2025 in cs.LG, cs.AI, cs.CL, and cs.RO | (2509.18200v1)

Abstract: Conversational agents must translate egocentric utterances (e.g., "on my right") into allocentric orientations (N/E/S/W). This challenge is particularly critical in indoor or complex facilities where GPS signals are weak and detailed maps are unavailable. While chain-of-thought (CoT) prompting has advanced reasoning in language and vision tasks, its application to multimodal spatial orientation remains underexplored. We introduce Conversational Orientation Reasoning (COR), a new benchmark designed for Traditional Chinese conversational navigation projected from real-world environments, addressing egocentric-to-allocentric reasoning in non-English and ASR-transcribed scenarios. We propose a multimodal chain-of-thought (MCoT) framework, which integrates ASR-transcribed speech with landmark coordinates through a structured three-step reasoning process: (1) extracting spatial relations, (2) mapping coordinates to absolute directions, and (3) inferring user orientation. A curriculum learning strategy progressively builds these capabilities on Taiwan-LLM-13B-v2.0-Chat, a mid-sized model representative of resource-constrained settings. Experiments show that MCoT achieves 100% orientation accuracy on clean transcripts and 98.1% with ASR transcripts, substantially outperforming unimodal and non-structured baselines. Moreover, MCoT demonstrates robustness under noisy conversational conditions, including ASR recognition errors and multilingual code-switching. The model also maintains high accuracy in cross-domain evaluation and resilience to linguistic variation, domain shift, and referential ambiguity. These findings highlight the potential of structured MCoT spatial reasoning as a path toward interpretable and resource-efficient embodied navigation.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a multimodal chain-of-thought (MCoT) framework that transforms egocentric spatial descriptions into allocentric directions.
The methodology integrates ASR-transcribed speech with spatial coordinates through structured reasoning, reaching 100% accuracy on clean text and 98.1% on noisy inputs.
Ablation studies confirm the critical role of spatial data and structured CoT reasoning in mitigating ASR errors and enhancing orientation accuracy.

Introduction

The paper "Conversational Orientation Reasoning: Egocentric-to-Allocentric Navigation with Multimodal Chain-of-Thought" introduces a novel benchmark and framework aimed at enhancing the reasoning capabilities of conversational agents. The problem addressed is the conversion of egocentric spatial descriptions, such as "on my right," into allocentric orientations like North, East, South, and West, particularly in environments lacking GPS signal or detailed maps. Utilizing a multimodal chain-of-thought (MCoT) framework, the study integrates automatic speech recognition (ASR) processed speech with spatial coordinates to guide an agent through this complex transformation.

Figure 1: Pipeline of our MCoT framework. It consists of three modules: (1) speech synthesis and transcription, (2) multimodal input preparation and fusion, and (3) orientation reasoning.

Methodology

The MCoT framework proposed in the paper comprises three key modules: speech synthesis and transcription, multimodal input preparation, and orientation reasoning. The approach involves synthesizing speech from clean egocentric descriptions, transcribing them with ASR, and combining them with spatial coordinates to infer absolute directions. The orientation reasoning is structured into three stages: extracting spatial relations, mapping coordinates to absolute directions, and inferring user orientation. This framework is implemented on a mid-sized LLM, Taiwan-LLM-13B-v2.0-Chat, optimized for orientation reasoning in Traditional Chinese, under noisy ASR conditions.

Figure 2: Task environment. Gongguan MRT area projected into a 10 × 10 grid map for testing.

Empirical Evaluation

Extensive experiments demonstrate the efficacy of the MCoT framework. The structured reasoning process achieves 100% accuracy on clean text and 98.1% on ASR transcripts, showing significant improvement over unimodal and non-structured baselines. The curriculum learning strategy progressively enhances orientation reasoning capabilities, contributing to robustness against linguistic variation, domain shift, and referential ambiguity.

Figure 3: Comparison of standard prompting and MCoT. Standard prompting fails under ambiguous egocentric descriptions, whereas MCoT uses structured steps for better accuracy and interpretability.

Ablation Studies and Robustness

Ablation studies reveal the importance of spatial coordinates and structured CoT reasoning, each contributing to improved accuracy and reduced format errors when ASR noise is introduced. Moreover, the model generalizes well across linguistic variations and new spatial domains while maintaining high accuracy in referentially ambiguous contexts.

Figure 4: ASR error severity distribution in the two evaluation sets.

Figure 5: Cross-domain evaluation environment. Taipei Station area projected into a 10 × 10 grid.

Error Analysis

Error analysis identifies residual errors primarily in direction understanding and ASR misrecognition. Representative error cases illustrate challenges in correctly applying spatial mapping rules and extracting spatial relations from noisy ASR inputs.

Figure 6: Representative error cases falling into three categories: direction understanding errors, relation extraction errors, and ASR misrecognition errors.

Conclusion

The paper presents a comprehensive framework that enhances conversational navigation through structured reasoning, achieving high accuracy and robustness across diverse scenarios. Despite the promising results, limitations exist, including the grid-based environment and reliance on synthesized speech data. Future work may explore larger continuous spaces, multilingual contexts, and integration of additional sensory data, such as visual cues, for further advancements in embodied navigation systems.

Markdown Report Issue