A Survey on Efficient Vision-Language-Action Models

Published 27 Oct 2025 in cs.CV, cs.AI, cs.LG, and cs.RO | (2510.24795v1)

Abstract: Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

Abstract PDF Upgrade to Chat

Summary

The paper introduces a detailed taxonomy categorizing efficient model design, training, and data collection strategies for vision-language-action systems.
The paper discusses innovative architectures such as linear attention and employs compression techniques like pruning and quantization to reduce computational demand.
The paper highlights practical applications in autonomous vehicles, smart homes, and robotics, while outlining challenges and future research directions for embodied intelligence.

A Survey on Efficient Vision-Language-Action Models

The paper "A Survey on Efficient Vision-Language-Action Models" (2510.24795) offers a comprehensive analysis of Vision-Language-Action (VLA) models, focusing on enhancing their efficiency across model design, training strategies, and data collection. VLAs represent a significant step in embodied intelligence, linking digital reasoning to physical actions. However, these models are traditionally computationally demanding, complicating their deployment in resource-constrained environments. The survey systematically categorizes current approaches into three main pillars to address these challenges.

Figure 1: Necessity of Efficient VLAs.

Efficient Model Design

The core of efficient VLA model design lies in creating architectures that minimize computational costs without sacrificing performance. This involves:

Efficient Architectures: Innovations in this area include linear attention mechanisms, transformer alternatives like Mamba for linear sequence modeling, and efficient decoding strategies such as parallel and generative action decoding. These methods aim to reduce latency and computational loads while maintaining accuracy in action generation.
Figure 2: Key strategies for Efficient Architectures in VLAs.
Model Compression: Techniques such as layer pruning, quantization, and token optimization are employed. These methods strip away redundant parameters and compress model representations, thus reducing model size and improving inference times while aiming to retain essential functionalities.
Figure 3: Key strategies for Model Compression in VLAs.

Efficient Training

Training VLAs efficiently involves strategies to reduce both data requirements and computational overheads:

Efficient Pre-Training: Strategies include data-efficient learning through leveraging unlabeled data with self-supervised learning, mixed data co-training, and innovative action representation methods to create compact, effective VLA policies.
Efficient Post-Training: Focuses on adapting models to specific tasks with minimal resources through methods such as parameter-efficient tuning and leveraging reinforcement learning for policy optimization. These approaches ensure VLAs can swiftly adapt to new environments or tasks without extensive retraining.
Figure 4: Key strategies for Efficient Training in VLAs.

Efficient Data Collection

Data collection is a significant bottleneck for VLAs, given the need for vast and diverse datasets. The survey categorizes efficient data collection approaches into:

Human-in-the-Loop and Simulation Data: Utilizing human inputs more strategically and expanding the use of simulations to generate diverse and scalable datasets.
Cross-Domain Utilization: Integrating Internet-scale data and cross-domain datasets to enhance the breadth and generalization of VLAs.
Self-Exploration and Augmentation: Methods to autonomously generate or enrich datasets, improving the model's understanding and performance without extensive manual data collection.
Figure 5: Taxonomy of Efficient Data Collection Strategies in VLAs.

Applications

Efficient VLA models have a wide range of applications:

Intelligent Vehicles: VLAs streamline processing in autonomous systems, enabling real-time decision-making essential for navigation and safety.
Smart Homes and Robotics: These models enhance interaction quality by providing robust command execution in domestic robots, prioritizing privacy and real-time processing on edge devices.
Industrial and Medical Robotics: In industrial settings, VLAs facilitate efficiency and adaptability across tasks, while in medical support, they ensure precision and security in sensitive environments.

Challenges and Future Directions

Despite advancements, VLAs face challenges such as balancing model compactness with expressivity, ensuring scalable and stable training, and overcoming data barriers. Future work should focus on adaptive architectures, resilient training paradigms, and sustainable data ecosystems that synergize to form a cohesive, efficient embodied intelligence framework.

Conclusion

The survey highlights the critical components required to advance the efficiency of VLA models, focusing on model design, training, and data collection. By systematically addressing these aspects, the paper sets the groundwork for future explorations that could enable widespread deployment of efficient, capable VLA systems across various domains.