What makes for good action tokenizers for VLA optimization

Determine the specific design principles and criteria that make discrete action tokenizers effective for optimizing Vision-Language-Action models that fine-tune Vision-Language Models under a native autoregressive paradigm, explicitly beyond reconstruction fidelity metrics and with respect to training efficiency and downstream performance.

Background

Vision-Language-Action models increasingly rely on discretizing robot actions into tokens so that pre-trained Vision-LLMs can be fine-tuned using the autoregressive objective. Despite widespread use, prior work has predominantly evaluated action tokenizers by reconstruction fidelity, leaving their impact on VLA training dynamics underexplored.

The paper centers this gap by analyzing tokenization through information-theoretic lenses—temporal overlap stability, vocabulary capacity, multimodal mutual information, and token independence—and proposes ActionCodec as an instantiation of these principles. The open question motivates establishing concrete, optimization-oriented design rules for action tokenizers that enable efficient, robust VLA training.

References

Consequently, the fundamental question of what makes for good action tokenizers remains unanswered.

ActionCodec: What Makes for Good Action Tokenizers  (2602.15397 - Dong et al., 17 Feb 2026) in Section 1 (Abstract), Page 1