Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
Abstract: A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pretrained vision-language foundation models in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a kitchen play dataset show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity. We use NILS to label over 115k trajectories obtained from over 430 hours of robot data. We open-source our auto-labeling code and generated annotations on our website: http://robottasklabeling.github.io.
- Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023.
- Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022.
- Droid: A large-scale in-the-wild robot manipulation dataset. CoRR, 2024.
- Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
- RT-1: robotics transformer for real-world control at scale. In K. E. Bekris, K. Hauser, S. L. Herbert, and J. Yu, editors, Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. doi:10.15607/RSS.2023.XIX.025.
- Scaling Robot Learning with Semantically Imagined Experience, Feb. 2023.
- Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
- T. Lüddecke and A. Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022.
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory. 2022. doi:10.48550/ARXIV.2210.05663. Publisher: arXiv Version Number: 3.
- Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11509–11522. IEEE, 2023.
- Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. In Robotics: Science and Systems, 2024.
- Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024.
- Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Audio visual language maps for robot navigation. In Proceedings of the International Symposium on Experimental Robotics (ISER), Chiang Mai, Thailand, 2023.
- Vision-language models provide promptable representations for reinforcement learning. arXiv preprint arXiv:2402.02651, 2024.
- RobotVQA: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024.
- Robotic control via embodied chain-of-thought reasoning. In Conference on Robot Learning, 2024.
- Lelan: Learning a language-conditioned navigation policy from in-the-wild video. In Conference on Robot Learning, 2024.
- Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024.
- Universal visual decomposer: Long-horizon manipulation made easy. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6973–6980. IEEE, 2024.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96. AAAI Press, 1996.
- Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316–1326, 2023.
- Robots that ask for help: Uncertainty alignment for large language model planners. In Conference on Robot Learning (CoRL). Proceedings of the Conference on Robot Learning (CoRL), 2023.
- Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024.
- Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
- REFLECT: Summarizing robot experiences for failure explanation and correction. In Conference on Robot Learning, pages 3468–3484. PMLR, 2023.
- Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024.
- G. Team. Gemini: A family of highly capable multimodal models, 2023.
- X-Clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638–647, 2022.
- Video-llava: Learning united visual representation by alignment before projection, 2023.
- Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences, 2024.
- ”Task Success” is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors, 2024.
- Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models, Feb. 2024.
- Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models, Feb. 2024.
- Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
- Scaling open-vocabulary object detection. Advances in Neural Information Processing Systems, 36, 2024.
- Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, Nov. 2023.
- Affordance learning from play for sample-efficient policy learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 6372–6378. IEEE, 2022.
- Grounding language with visual affordances over unstructured data, 2023.
- S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023.
- VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners, Mar. 2023.
- Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, 35:8483–8497, 2022.
- CLIP2Video: Mastering Video-Text Retrieval via Image CLIP, June 2021.
- Large Language Models are Temporal and Causal Reasoners for Video Question Answering, Nov. 2023.
- Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models, July 2023.
- Goal representations for instruction following: A semi-supervised language interface to control. In Conference on Robot Learning, pages 3894–3908. PMLR, 2023.
- Policy adaptation from foundation model feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19059–19069, 2023.
- Liv: Language-image representations and rewards for robotic control. In International Conference on Machine Learning, pages 23301–23320. PMLR, 2023.
- Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European Conference on Computer Vision, pages 640–658. Springer, 2022.
- SPRINT: Scalable policy pre-training via language instruction relabeling. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9168–9175. IEEE, 2024.
- EfficientSAM: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16111–16121, 2024.
- Vip: Towards universal visual reward and representation via value-implicit pre-training, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024.
- Goal conditioned imitation learning using score-based diffusion policies. In Robotics: Science and Systems, 2023.
- Physically grounded vision-language models for robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12462–12469. IEEE, 2024.
- Learning latent plans from play. In Conference on robot learning, pages 1113–1132. PMLR, 2020.
- Latent plans for task-agnostic offline reinforcement learning. In Conference on Robot Learning, pages 1838–1849. PMLR, 2023.
- C. Lynch and P. Sermanet. Language Conditioned Imitation Learning over Unstructured Data, July 2021.
- What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022.
- Roboclip: One demonstration is enough to learn robot policies. volume 36, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.