Vision-and-Language Navigation via Causal Learning
Abstract: In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.
- Counterfactual vision and language learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10044–10054, 2020.
- Neighbor-view enhanced model for vision and language navigation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 5101–5109, 2021.
- Bevbert: Multimodal map pre-training for language-guided navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018.
- Layer normalization. ArXiv, abs/1607.06450, 2016.
- The dropout learning algorithm. Artificial intelligence, 210:78–122, 2014.
- Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017.
- History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34, 2021.
- Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022.
- Grounded entity-landmark adaptive pre-training for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12043–12053, 2023.
- Unbiased directed object attention graph for object navigation. In Proceedings of the 30th ACM International Conference on Multimedia, page 3617–3627, New York, NY, USA, 2022. Association for Computing Machinery.
- Multiple thinking achieving meta-ability decoupling for object navigation. In International Conference on Machine Learning (ICML), 2023a.
- Search for or navigate to? dual adaptive thinking for object navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8250–8259, 2023b.
- Foam: A follower-aware speaker model for vision-and-language navigation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4332–4340, 2022.
- An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022.
- Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
- Causal inference in statistics: A primer. John Wiley & Sons, 2016.
- Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021.
- Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020.
- Learning depth representation from rgb-d videos by time-aware contrastive pre-training. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023a.
- Mlanet: Multi-level attention network with sub-instruction for continuous vision-and-language navigation. arXiv preprint arXiv:2303.01396, 2023b.
- Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021.
- Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023.
- Are you looking? grounding to multiple modalities in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6551–6557, 2019.
- Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23212–23221, 2023.
- Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1862–1872, 2019.
- A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10813–10823, 2023.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020.
- Improving vision-and-language navigation by generating future-view image semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10803–10812, 2023.
- Improving cross-modal alignment in vision language navigation via syntactic information. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1041–1050, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
- From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21273–21282, 2022b.
- Envedit: Environment editing for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15407–15417, 2022c.
- Adapt: Vision-language navigation with modality-aligned action prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15396–15406, 2022.
- Learning vision-and-language navigation from youtube videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8317–8326, 2023.
- Scene-intuitive agent for remote embodied visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7036–7045, 2021.
- Show, deconfound and tell: Image captioning with causal inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18041–18050, 2022.
- Vision-language navigation with random environmental mixup. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1644–1654, 2021.
- Bird’s-eye-view scene graph for vision-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023a.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Cross-modal causal relational reasoning for event-level visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259–274. Springer, 2020.
- Soat: A scene-and object-aware transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34:7357–7367, 2021.
- Instance-level semantic maps for vision language navigation. arXiv preprint arXiv:2305.12363, 2023.
- Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 684–695, 2019.
- The deep regression bayesian network and its applications: Probabilistic deep learning for computer vision. IEEE Signal Processing Magazine, 35(1):101–111, 2018.
- Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12700–12710, 2021.
- Counterfactual vision-and-language navigation: Unravelling the unseen. Advances in Neural Information Processing Systems, 33:5296–5307, 2020.
- Judea Pearl. Causality. Cambridge university press, 2009.
- The book of why: the new science of cause and effect. Basic books, 2018.
- Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020.
- Hop+: History-enhanced and order-aware pre-training for vision-and-language navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
- Vln-petl: Parameter-efficient transfer learning for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15443–15452, 2023b.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- How much can clip benefit vision-and-language tasks? In International Conference on Learning Representations.
- Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2610–2621, 2019.
- Vision-and-dialog navigation. In Conference on Robot Learning, pages 394–406. PMLR, 2020.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15471–15481, 2022a.
- Res-sts: Referring expression speaker via self-training with scorer for goal-oriented vision-language navigation. IEEE Transactions on Circuits and Systems for Video Technology, 2023a.
- A dual semantic-aware recurrent global-adaptive network for vision-and-language navigation. In International Joint Conferences on Artificial Intelligence (IJCAI), 2023b.
- Pasts: Progress-aware spatio-temporal transformer speaker for vision-and-language navigation. arXiv preprint arXiv:2305.11918, 2023c.
- Less is more: Generating grounded navigation instructions from landmarks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15428–15438, 2022b.
- Visual commonsense r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10760–10770, 2020a.
- Causal attention for unbiased visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3091–3100, 2021.
- Vision-language navigation policy learning and adaptation. IEEE transactions on pattern analysis and machine intelligence, 2020b.
- Meta-causal feature learning for out-of-distribution generalization. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 530–545. Springer, 2023d.
- Scaling data generation in vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023e.
- Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625–15636, 2023f.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Deconfounded image captioning: A causal retrospect. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.
- Causal attention for vision-language tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9847–9857, 2021b.
- Interventional few-shot learning. Advances in neural information processing systems, 33:2734–2746, 2020.
- Multiple adverse weather conditions adaptation for object detection via causal intervention. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2022.
- 3d-aware object goal navigation via simultaneous exploration and identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6672–6682, 2023.
- Devlbert: Learning deconfounded visio-linguistic representations. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4373–4382, 2020.
- Diagnosing the environment bias in vision-and-language navigation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2021.
- Target-driven structured transformer planner for vision-language navigation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4194–4203, 2022.
- Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12689–12699, 2021.
- Diagnosing vision-and-language navigation: What really matters. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5981–5993, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.