Practical distillation methods for large-scale Vision-Language-Action models

Develop practical distillation methods for large-scale Vision-Language-Action models that enable real-time robotic control during deployment on physical platforms.

Background

The paper highlights a growing tension between the increasing size of Vision-Language-Action (VLA) models and the strict real-time requirements of robotic control. Although several distillation techniques have been explored to reduce inference latency, these approaches are not yet sufficient to make large VLA models practically usable in real-time robotic settings.

Addressing this gap requires methods that can compress or otherwise optimize large VLA models so that they remain effective while satisfying the control-frequency and latency constraints necessary for real-world manipulation tasks.

References

While some distillation techniques have been explored (Chen et al., 2023; Wang et al., 2024b; Prasad et al., 2024), the development of practical methods for large-scale VLA models remains an open problem.

RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization  (2602.03310 - Liu et al., 3 Feb 2026) in Section 1, Introduction