CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model

Published 14 Aug 2025 in cs.RO, cs.AI, cs.CL, and cs.CV | (2508.10416v1)

Abstract: Existing vision-and-language navigation models often deviate from the correct trajectory when executing instructions. However, these models lack effective error correction capability, hindering their recovery from errors. To address this challenge, we propose Self-correction Flywheel, a novel post-training paradigm. Instead of considering the model's error trajectories on the training set as a drawback, our paradigm emphasizes their significance as a valuable data source. We have developed a method to identify deviations in these error trajectories and devised innovative techniques to automatically generate self-correction data for perception and action. These self-correction data serve as fuel to power the model's continued training. The brilliance of our paradigm is revealed when we re-evaluate the model on the training set, uncovering new error trajectories. At this time, the self-correction flywheel begins to spin. Through multiple flywheel iterations, we progressively enhance our monocular RGB-based VLA navigation model CorrectNav. Experiments on R2R-CE and RxR-CE benchmarks show CorrectNav achieves new state-of-the-art success rates of 65.1% and 69.3%, surpassing prior best VLA navigation models by 8.2% and 16.4%. Real robot tests in various indoor and outdoor environments demonstrate \method's superior capability of error correction, dynamic obstacle avoidance, and long instruction following.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a self-correction flywheel that iteratively improves VLN model performance by detecting deviations and generating correction data.
The methodology employs deviation detection and domain randomization, achieving success rates of 65.1% and 69.3% on R2R-CE and RxR-CE benchmarks respectively.
The results highlight robust navigation improvements in both simulated and real-world deployments, paving the way for advanced embodied intelligence.

Introduction

The paper "CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model" (2508.10416) presents an innovative approach to improving Vision-Language Navigation (VLN) models by integrating self-correction capabilities. Traditional VLN models focus on enhancing visual perception and multimodal reasoning to navigate environments based on natural language instructions. However, the absence of effective error correction mechanisms often results in significant deviations from intended paths. CorrectNav addresses this limitation through a novel post-training paradigm referred to as the Self-correction Flywheel, enabling models to automatically generate self-correction data and iteratively enhance their performance.

Self-correction Flywheel Paradigm

CorrectNav's Self-correction Flywheel is conceptualized as a continuous loop comprising model evaluation, deviation detection, self-correction data generation, and continued training.

Figure 1: The overview of CorrectNav training. CorrectNav is first finetuned on the navigation tasks (Left), including action prediction and instruction generation. To enhance vision diversity, domain randomization strategies are implemented. Subsequently, the model undergoes Self-correction Flywheel post-training (Right), engaging in continuous evaluation and training cycles to improve error recovery.

Detailed Process

Model Evaluation: The trained model is reevaluated on its training set to identify error trajectories. This step leverages incorrect paths as opportunities for model refinement rather than shortcomings.
Deviation Detection: CorrectNav employs a method to detect deviations by measuring distances between error trajectories and true paths using interpolated reference points.
Self-Correction Data Creation: The paradigm creates valuable correction data using identified deviations. For action correction, trajectories illustrating successful recovery are collected, while multimodal models analyze keyframes for perception correction.
Continued Training: The model is refined using a blend of original training data and newly generated correction data, leading to progressive performance enhancements through multiple flywheel iterations.

Numerical Results and Comparative Analysis

CorrectNav achieves a state-of-the-art success rate of 65.1% on the R2R-CE and 69.3% on the RxR-CE benchmarks, significantly outperforming previous VLN models by margins of 8.2% and 16.4%, respectively.

Figure 2: Case studies comparing CorrectNav with and without Self-correction Flywheel post-training illustrating enhanced path correction capabilities.

Effectiveness of Self-Correction Iterations

The paper highlights the quantitative improvements offered by CorrectNav through iterative flywheel training, showcasing increased success rates and reduced navigation errors across successive iterations.

Figure 3: CorrectNav's performance on R2R-CE and RxR-CE Val-Unseen splits over Self-correction Flywheel iterations.

Real-world Deployment

CorrectNav's capabilities extend beyond simulated environments, showing superior performance in real-world scenarios, including complex indoor and outdoor navigation tasks. The model's dynamic obstacle avoidance and long-horizon instruction following are effectively demonstrated in robotic tests.

Figure 4: Qualitative results from the real-world deployment of CorrectNav, exhibiting robust navigation and error recovery in natural settings.

Implications and Future Directions

The introduction of self-correcting mechanisms in VLN models provides a substantial enhancement in robustness and efficiency, particularly in deployment scenarios requiring fast inference. CorrectNav serves as a promising step toward embodied intelligence with improved recovery capabilities. Future research aims to further enrich navigation models by integrating robotic dimension and state information as prior knowledge, potentially reducing navigation discrepancies attributed to monocular RGB sensing limitations.

Conclusion

CorrectNav embodies a significant advancement in VLN by embedding error correction into the core of model operation. Its Self-correction Flywheel not only fosters performance improvements but also equips models with the ability to autonomously adapt to unforeseen challenges in the navigation task. These attributes lay the groundwork for future exploration and development in vision-language-action models and their applications in real-world environments.

Markdown Report Issue