Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Published 23 Oct 2024 in cs.RO, cs.AI, cs.CV, and cs.LG | (2410.17772v2)

Abstract: A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pretrained vision-language foundation models in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a kitchen play dataset show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity. We use NILS to label over 115k trajectories obtained from over 430 hours of robot data. We open-source our auto-labeling code and generated annotations on our website: http://robottasklabeling.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023.
  2. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022.
  3. Droid: A large-scale in-the-wild robot manipulation dataset. CoRR, 2024.
  4. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  5. RT-1: robotics transformer for real-world control at scale. In K. E. Bekris, K. Hauser, S. L. Herbert, and J. Yu, editors, Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. doi:10.15607/RSS.2023.XIX.025.
  6. Scaling Robot Learning with Semantically Imagined Experience, Feb. 2023.
  7. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
  8. T. Lüddecke and A. Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022.
  9. CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory. 2022. doi:10.48550/ARXIV.2210.05663. Publisher: arXiv Version Number: 3.
  10. Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11509–11522. IEEE, 2023.
  11. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. In Robotics: Science and Systems, 2024.
  12. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024.
  13. Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
  14. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  15. Audio visual language maps for robot navigation. In Proceedings of the International Symposium on Experimental Robotics (ISER), Chiang Mai, Thailand, 2023.
  16. Vision-language models provide promptable representations for reinforcement learning. arXiv preprint arXiv:2402.02651, 2024.
  17. RobotVQA: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024.
  18. Robotic control via embodied chain-of-thought reasoning. In Conference on Robot Learning, 2024.
  19. Lelan: Learning a language-conditioned navigation policy from in-the-wild video. In Conference on Robot Learning, 2024.
  20. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024.
  21. Universal visual decomposer: Long-horizon manipulation made easy. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6973–6980. IEEE, 2024.
  22. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023.
  23. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96. AAAI Press, 1996.
  24. Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316–1326, 2023.
  25. Robots that ask for help: Uncertainty alignment for large language model planners. In Conference on Robot Learning (CoRL). Proceedings of the Conference on Robot Learning (CoRL), 2023.
  26. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024.
  27. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
  28. REFLECT: Summarizing robot experiences for failure explanation and correction. In Conference on Robot Learning, pages 3468–3484. PMLR, 2023.
  29. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024.
  30. G. Team. Gemini: A family of highly capable multimodal models, 2023.
  31. X-Clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638–647, 2022.
  32. Video-llava: Learning united visual representation by alignment before projection, 2023.
  33. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences, 2024.
  34. ”Task Success” is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors, 2024.
  35. Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models, Feb. 2024.
  36. Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models, Feb. 2024.
  37. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
  38. Scaling open-vocabulary object detection. Advances in Neural Information Processing Systems, 36, 2024.
  39. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, Nov. 2023.
  40. Affordance learning from play for sample-efficient policy learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 6372–6378. IEEE, 2022.
  41. Grounding language with visual affordances over unstructured data, 2023.
  42. S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022.
  43. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  44. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023.
  45. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners, Mar. 2023.
  46. Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, 35:8483–8497, 2022.
  47. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP, June 2021.
  48. Large Language Models are Temporal and Causal Reasoners for Video Question Answering, Nov. 2023.
  49. Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models, July 2023.
  50. Goal representations for instruction following: A semi-supervised language interface to control. In Conference on Robot Learning, pages 3894–3908. PMLR, 2023.
  51. Policy adaptation from foundation model feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19059–19069, 2023.
  52. Liv: Language-image representations and rewards for robotic control. In International Conference on Machine Learning, pages 23301–23320. PMLR, 2023.
  53. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European Conference on Computer Vision, pages 640–658. Springer, 2022.
  54. SPRINT: Scalable policy pre-training via language instruction relabeling. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9168–9175. IEEE, 2024.
  55. EfficientSAM: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16111–16121, 2024.
  56. Vip: Towards universal visual reward and representation via value-implicit pre-training, 2023.
  57. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024.
  58. Goal conditioned imitation learning using score-based diffusion policies. In Robotics: Science and Systems, 2023.
  59. Physically grounded vision-language models for robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12462–12469. IEEE, 2024.
  60. Learning latent plans from play. In Conference on robot learning, pages 1113–1132. PMLR, 2020.
  61. Latent plans for task-agnostic offline reinforcement learning. In Conference on Robot Learning, pages 1838–1849. PMLR, 2023.
  62. C. Lynch and P. Sermanet. Language Conditioned Imitation Learning over Unstructured Data, July 2021.
  63. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022.
  64. Roboclip: One demonstration is enough to learn robot policies. volume 36, 2024.
Citations (2)

Summary

  • The paper introduces NILS, a zero-shot labeling framework that scales robot policy learning by automatically annotating over 115,000 trajectories without human intervention.
  • It employs a three-stage process using vision-language models to detect objects, label object-centric actions, and identify key task states.
  • Evaluations on datasets like BridgeV2, Fractal, and a kitchen play dataset show that NILS outperforms current state-of-the-art models in task segmentation and annotation detail.

Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models: An Expert Overview

The paper "Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models" introduces a novel framework named Natural Language Instruction Labeling for Scalability (NILS) for labeling robot demonstrations. The primary objective of NILS is to address the limitations posed by the scarcity of natural language annotations in existing robot datasets, which is essential for developing robots that can easily relate human language to their perception and actions. Typically, language-conditioned robot policies are trained using templated annotations or expensive human-generated instructions, limiting their scalability. NILS offers a zero-shot labeling solution that operates without human intervention, enabling it to efficiently annotate large-scale, uncurated, long-horizon robot datasets.

At the core of NILS is its capability to leverage pretrained vision-language foundation models to identify objects, detect object-centric changes, segment tasks, and label behavior datasets autonomously. The evaluation of NILS was conducted on various datasets, including BridgeV2, Fractal, and a kitchen play dataset, demonstrating its capability to effectively annotate large volumes of unlabeled robot data, totaling over 115,000 trajectories and 430 hours of robot data.

A significant contribution of NILS is its ability to provide high-quality annotations while mitigating issues associated with crowdsourced human-generated labels such as inconsistencies and limited diversity. The paper shows that NILS surpasses existing state-of-the-art video-LLMs in identifying keystates and annotates tasks with a greater level of detail.

Methodology

The NILS framework is divided into three stages:

  1. Stage 1 concerns the identification of objects in a scene, utilizing multiple vision-LLMs to detect and consistently name objects despite occlusions.
  2. Stage 2 focuses on object-centric scene annotation, monitoring four key signals: object relations and movement, object state changes, gripper position, and gripper closing actions.
  3. Stage 3 involves the detection of keystates and the generation of language labels, employing a heuristic consensus approach to identify important keystates and LLMs to approximate free-form instructions.

Implications and Future Directions

The implications of NILS are profound in both practical and theoretical domains. Practically, the ability to automatically label large datasets significantly reduces the cost and labor associated with training language-conditioned robot policies. Theoretically, it advances the notion of scalable robot learning by demonstrating the efficacy of foundation models in generating detailed and contextually relevant annotations.

The results of employing NILS indicate a promising direction for future developments in artificial intelligence, particularly in the domain of robotics. The framework's application of multiple specialist models highlights the potential for modular approaches to improve robotic perception and interaction capabilities. Additionally, the insights gained from this work may stimulate further research into refining foundation models for more specialized tasks and environments.

Overall, NILS exemplifies a forward-looking approach to overcoming the challenges of robot policy learning, laying groundwork for more intuitive and capable robotic systems that can better navigate and interact with the complexities of the real world. As foundation models continue to evolve, frameworks like NILS will undoubtedly catalyze further breakthroughs in scalable, autonomous machine learning systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 12 likes about this paper.