LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration

Published 26 Dec 2025 in cs.CV and cs.AI | (2512.22010v1)

Abstract: Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that integrating spatiotemporal history improves instruction-aligned UAV navigation.
The SHIC module compresses visual inputs into fixed semantic slots while STE encodes motion continuity to reduce drift.
Experimental results show an increase in success rate (+7.89%) and lower navigation errors in unseen, complex environments.

Introduction and Problem Context

Unmanned aerial vehicles (UAVs) are integral to complex, high-stakes applications such as post-disaster search and rescue, remote sensing, and large-scale environmental monitoring. The shift to vision-and-language navigation (VLN) represents a critical stage in enabling UAVs to autonomously interpret and execute natural language instructions within authentic and dynamic three-dimensional (3D) environments. Although recent advances in UAV-VLN have demonstrated promise for short-horizon, atomic navigation tasks, end-to-end robustness and semantic alignment in complex, long-horizon scenarios remain significantly under-explored. Existing approaches inadequately model the long-horizon spatiotemporal context, frequently resulting in drift, semantic misalignment, and unreliable path planning.

The work "LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration" (2512.22010) directly addresses the core challenge of integrating fragmented historical sensory and trajectory information with ongoing perception and language instructions. Central to its thesis is the proposal of a unified, history-aware spatiotemporal framework for VLN, which compresses redundant cues into compact, structured representations to support robust, instruction-aligned, and efficient navigation in both seen and unseen environments.

Figure 1: Integrating spatiotemporal context enables LongFly to maintain robust navigation despite viewpoint and layout changes, contrasting with prior reliance on myopic cues.

LongFly Architecture: Structured Spatiotemporal Modeling

The LongFly framework comprises three technically interdependent modules: Slot-Based Historical Image Compression (SHIC), Spatio-Temporal Trajectory Encoding (STE), and Prompt-Guided Multimodal Integration (PGM). Collectively, these modules transform highly redundant and unstructured sensory histories into semantically relevant, low-dimensional state representations expressly aligned to the navigation instruction.

Figure 2: High-level architecture of LongFly, highlighting the integration of language, historical visual memory, and trajectory tokens for end-to-end multimodal reasoning.

Slot-Based Historical Image Compression (SHIC)

SHIC distills historical multi-view RGB observations into a fixed number of recurrent semantic slots. Visual features are projected (via CLIP ViT-L/14) and repeatedly aggregated through a soft-attentive mechanism, followed by GRU-based updates. This fixed-capacity slotting achieves temporal consistency and salient cue preservation while compressing otherwise linear-memory growth ( $O(1)$ overhead). Key properties include persistent retention of critical landmarks, spatial layout, and discrimination of cues across multiple camera views.

Figure 3: SHIC module encodes variable-length, multi-view visual histories into a dynamic, fixed slot representation suitable for integration with temporal and instruction information.

Spatio-Temporal Trajectory Encoding (STE)

STE encodes the UAV’s history of predicted waypoints as a temporally ordered, low-dimensional feature sequence. Rather than using absolute positions, relative motion vectors are decomposed into direction and step length, each augmented with time embeddings. These motion descriptors are passed through MLP layers to yield trajectory tokens encoding motion continuity and capturing long-horizon path evolution. This design explicitly tackles model instability and global drift common in large-scale navigation.

Figure 4: STE module structures temporal history as relative-motion tokens, synthesizing both displacement and timing for robust path prior encoding.

Prompt-Guided Multimodal Integration (PGM)

The PGM module fuses instruction, compressed historical visual memory, and encoded trajectory within a structured prompt format, compatible with LLM architectures (notably Qwen2.5-3B). Historical cues are projected into a unified embedding space – matching the multimodal token space of the backbone – and incorporated with explicit tagging to enable token-level alignment and reasoning. Prompt-based integration supersedes naive feature concatenation, significantly enhancing the model’s ability to maintain context over extended episodes and complex semantic trajectories.

Figure 5: Prompt construction in LongFly, explicitly serializing language, motion, history, and current sensory inputs for cross-modal LLM-based reasoning.

Experimental Evaluation and Ablation

Benchmarks and Setup

LongFly is evaluated on the OpenUAV dataset, leveraging the AirSim simulator for diverse, realistic urban/natural environments with long trajectories and highly varied object categories. The principal metrics are navigation error (NE), success rate (SR), oracle success rate (OSR), and success-weighted path length (SPL), following established VLN protocols.

Quantitative Results

LongFly demonstrates strong numerical improvements compared to state-of-the-art baselines, notably TravelUAV, NavFoM, and CMA. On the full unseen test set, LongFly achieves a +7.89% absolute increase in SR and +6.33% in SPL, with NE dropping from 118.34 to 91.84 meters over the closest baseline. The most pronounced gains are observed on Hard splits with complex layouts and increased instruction ambiguity. Even in entirely unseen maps and with novel object categories, LongFly preserves substantial margins in SR, OSR, and SPL over all competitors.

Figure 7: Comparative evaluation across unseen environment splits, highlighting robust gains in SR, OSR, SPL, and reduced NE for LongFly over strong baselines.

Ablation and Architectural Sensitivity

Progressive ablation demonstrates the incremental and complementary value of SHIC and STE, with ablation of either module resulting in monotonic degradation of all performance metrics. Prompt-based fusion outperforms plain concatenation by a large margin in both SR and SPL, confirming the necessity of explicit multimodal alignment. Model robustness is further validated through hyperparameter sweeps over learning rate, slot count, and history length, with significant sensitivity to historical context span (longer visual/motion histories yield larger improvements).

Qualitative Assessment

Qualitative trajectory comparisons indicate that baseline models exhibit myopic behavior, drifting under viewpoint shifts or complex scene layouts. In contrast, LongFly preserves global semantic consistency, navigating reliably using aligned context from historical images and motion, especially around key landmarks.

Figure 6: Qualitative rollout comparison: LongFly maintains global trajectory alignment and consistent landmark grounding, outperforming a Qwen-based baseline with no context modeling.

Implications and Future Prospects

This work establishes that unified, prompt-structured spatiotemporal context modeling is a prerequisite for reliable long-horizon UAV VLN – especially under conditions of environmental and instruction complexity. The integration strategy – combining recurrent slot-based vision memory, temporal motion encoding, and instruction-grounded multimodal prompts – is versatile and compatible with LLM architectures, enabling out-of-distribution generalization and efficient scaling.

Practically, LongFly’s advancements are directly applicable to autonomous aerial search, persistent surveillance, and infrastructure inspection scenarios where navigational instruction, dynamic viewpoint, and spatial drift are critical bottlenecks. Theoretically, the demonstrated efficacy of cross-modal spatiotemporal unification suggests a template for more general long-term instruction-following in embodied AI, including ground robotics or mixed-reality agents.

Potential future trajectories include: (1) transfer to real-world UAV platforms with noisy and partial observations; (2) extension to multi-agent and collaborative tasks with distributed spatiotemporal memory; (3) enhancement of environmental diversity during training to further ameliorate generalization gaps on novel maps and structural priors; and (4) instantiation atop more expressive multimodal foundation models for richer semantic grounding.

Conclusion

LongFly systematically addresses the limitations of prior UAV VLN systems by integrating structured, compressive, and instruction-aligned history modeling for long-horizon navigation. Its modular approach improves both navigation reliability and semantic grounding, achieving state-of-the-art results across all evaluated metrics and splits. The framework provides a principled path toward more generalizable, robust, and instruction-following embodied AI in high-dimensional and geometrically complex environments.

(2512.22010)

Markdown Report Issue