A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots
Abstract: A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.
- Learning visual-audio representations for voice-controlled robots. In IEEE International Conference on Robotics and Automation (ICRA), 2023.
- C. Matuszek. Grounded language learning: Where robotics and nlp meet. In International Joint Conference on Artificial Intelligence (IJCAI), 2018.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2018.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning (CoRL), pages 706–717, 2022.
- Film: Following instructions in language with modular methods. arXiv preprint arXiv:2110.07342, 2021.
- Robot sound interpretation: Combining sight and sound in learning-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5580–5587, 2020.
- Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.
- Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12627–12637, 2019.
- Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- You only demonstrate once: Category-level manipulation from single visual demonstration. In Robotics: Science and Systems (RSS), 2022.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning (CoRL), 2022.
- A. Yu and R. J. Mooney. Using both demonstrations and language instructions to efficiently learn robotic tasks. In International Conference on Learning Representations (ICLR), 2022.
- Play it by ear: Learning skills amidst occlusion through audio-visual imitation learning. In Robotics: Science and Systems, 2022.
- Towards end-to-end spoken language understanding. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5754–5758, 2018.
- Speech model pre-training for end-to-end spoken language understanding. In Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019.
- St-bert: Cross-modal language model pre-training for end-to-end spoken language understanding. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 7478–7482, 2021.
- Grounding speech utterances in robotics affordances: An embodied statistical language model. In Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 79–86, 2016.
- Efficient grounding of abstract spatial concepts for natural language interaction with robot platforms. The International Journal of Robotics Research, 37(10):1269–1299, 2018.
- Understanding natural language instructions for fetching daily objects using gan-based multimodal target–source classification. IEEE Robotics and Automation Letters, 4(4):3884–3891, 2019.
- Robust spoken language understanding for house service robots. Polibits, (54):11–16, 2016.
- Robust understanding of robot-directed speech commands using sequence to sequence with noise injection. Frontiers in Robotics and AI, 6:144, 2020.
- Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
- Interactive grounded language acquisition and generalization in a 2d world. In International Conference on Learning Representations (ICLR), 2018.
- Gated-attention architectures for task-oriented language grounding. In Conference on Artificial Intelligence (AAAI), pages 2819–2826, 2018.
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Visual language maps for robot navigation. In IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning (CoRL), pages 991–1002, 2022.
- R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning (CoRL), 2022.
- Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, 2018.
- Roll: Visual self-supervised reinforcement learning with object reasoning. In Conference on Robot Learning (CoRL), 2020.
- The ingredients of real world robotic reinforcement learning. In International Conference on Learning Representations (ICLR), 2020.
- Time-contrastive networks: Self-supervised learning from video. In IEEE International Conference on Robotics and Automation (ICRA), pages 1134–1141, 2018.
- Grasp2vec: Learning object representations from self-supervised grasping. In Conference on Robot Learning (CoRL), 2018.
- LM-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning (CoRL), 2022.
- A. Yu and R. Mooney. Using both demonstrations and language instructions to efficiently learn robotic tasks. In Conference on Robot Learning (CoRL), 2023.
- Grounding language with visual affordances over unstructured data. In IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Audioclip: Extending clip to image, text and audio. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980, 2022.
- S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 28(4):357–366, 1980.
- Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning (ICML), pages 1597–1607, 2020.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Auto-tuned sim-to-real transfer. In IEEE International Conference on Robotics and Automation (ICRA), pages 1290–1296, 2021.
- P. Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
- Neural audio synthesis of musical notes with wavenet autoencoders. In International Conference on Machine Learning (ICML), pages 1068–1077, 2017.
- A dataset and taxonomy for urban sound research. In International Conference on Multimedia (ACM-MM), pages 1041–1044, 2014.
- K. J. Piczak. ESC: Dataset for Environmental Sound Classification. In Annual ACM Conference on Multimedia, pages 1015–1018. ACM Press, 2015.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Exploration by random network distillation. In International Conference on Learning Representations (ICLR), 2019.
- E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019.
- AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
- Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022.
- Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
- Diet: Lightweight language understanding for dialogue systems. arXiv preprint arXiv:2004.09936, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.