DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Abstract: MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Visual instruction tuning. NeurIPS, 2024.
- Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015, 2023.
- Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond. arXiv preprint arXiv:2310.02071, 2023.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. CoRL, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Vision-language foundation models as effective robot imitators. ICLR, 2024.
- Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022.
- Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems, 2020.
- Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020.
- What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022.
- Grounding language with visual affordances over unstructured data. In ICRA, 2023.
- Rt-1: Robotics transformer for real-world control at scale. Proceedings of Robotics: Science and Systems, 2024.
- Octo: An open-source generalist robot policy. First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024.
- Palm-e: An embodied multimodal language model. ICML, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 2023.
- Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 1, 2023.
- From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE, 2023.
- A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In EMNLP, 2023.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Rwkv: Reinventing rnns for the transformer era. Findings of EMNLP, 2023.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- Flatten transformer: Vision transformer using focused linear attention. In ICCV, 2023.
- Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905, 2024.
- Model surgery: Modulating llm’s behavior via simple parameter editing. arXiv preprint arXiv:2407.08770, 2024.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. In ICML, pages 10323–10337, 2023.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
- Llm-pruner: On the structural pruning of large language models. NeurIPS, 36:21702–21720, 2023.
- Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. In ICLR, 2023.
- Quip: 2-bit quantization of large language models with guarantees. NeurIPS, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, pages 38087–38099, 2023.
- Llm-mq: Mixed-precision quantization for efficient llm deployment. In The Efficient Natural Language and Speech Processing Workshop with NeurIPS, volume 9, 2023.
- Dynamic neural networks: A survey. TPAMI, 44(11):7436–7456, 2021.
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628, 2023.
- Layer skip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024.
- Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv preprint arXiv:2404.02258, 2024.
- Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. Advances in neural information processing systems, 34:11960–11973, 2021.
- Adaptive neural networks for efficient inference. In ICML, 2017.
- Multi-scale dense networks for resource efficient image classification. ICLR, 2018.
- Dynamic perceiver for efficient visual recognition. In ICCV, 2023.
- Condensenet v2: Sparse feature reactivation for deep networks. In CVPR, 2021.
- Latency-aware unified dynamic networks for efficient image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2024.
- Depth-adaptive transformer. ICLR, 2020.
- Berxit: Early exiting for bert with better fine-tuning and extension to regression. In ACL, 2021.
- Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020.
- Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020.
- Be3r: Bert based early-exit using expert routing. In KDD, 2022.
- Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism. arXiv preprint arXiv:2312.04916, 2023.
- Confident adaptive language modeling. NeurIPS, 2022.
- Deecap: Dynamic early exiting for efficient image captioning. In CVPR, pages 12216–12226, 2022.
- You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In CVPR, 2023.
- Frameexit: Conditional early exiting for efficient video recognition. In CVPR, 2021.
- Adanat: Exploring adaptive policy for token-based image generation. In ECCV, 2024.
- Real-world image dehazing with coherence-based label generator and cooperative unfolding network. arXiv preprint arXiv:2406.07966, 2024.
- Value-consistent representation learning for data-efficient reinforcement learning. In AAAI, 2023.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021.
- The optimal control of partially observable markov processes over a finite horizon. Operations research, 1973.
- Long short-term memory. Neural computation, 1997.
- Resolution adaptive networks for efficient inference. In CVPR, 2020.
- Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.
- Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 2015.
- Reinforcement learning: An introduction. 1998.
- Unleashing large-scale video generative pre-training for visual robot manipulation, 2023.
- Language-conditioned imitation learning with base skill priors under unstructured data. ICML, 2024.
- Zero-shot robotic manipulation with pretrained image-editing diffusion models. ICLR, 2024.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Language control diffusion: Efficiently scaling through space, time, and tasks. arXiv preprint arXiv:2210.15629, 2022.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Understanding, predicting and better resolving q-value divergence in offline-rl. Advances in Neural Information Processing Systems, 36, 2024.
- MosaicMLÂ NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.