VQ-BeT: Vector-Quantized Behavior Transformer
- VQ-BeT is a generative modeling framework that utilizes a hierarchical vector quantization bottleneck to represent complex, multimodal continuous behavior sequences.
- It overcomes limitations of traditional k-means discretization and iterative denoising by enabling end-to-end differentiable training and single-pass inference.
- Empirical results across manipulation, locomotion, and autonomous driving show notable performance and efficiency gains over baseline models.
Vector-Quantized Behavior Transformer (VQ-BeT) is a generative modeling framework designed for complex, multi-modal, and continuous behavior sequences in decision-making tasks. It addresses the limitations of prior approaches, such as Behavior Transformers (BeT) relying on k-means discretization and diffusion policies using iterative denoising, by introducing a hierarchical vector quantization bottleneck. This enables compact and expressive latent representations of actions that can be efficiently modeled and predicted by transformer architectures in both conditional and unconditional settings (Lee et al., 2024).
1. Background and Motivation
Modeling complex behavior sequences for imitation and offline policy learning requires capturing multimodal, high-dimensional, and continuous action distributions. Earlier approaches include:
- Behavior Transformers (BeT): Discretize actions using k-means clustering into bins, followed by categorical prediction and a small continuous offset. However, k-means is limited by its fixed binning, inability to scale in high-dimensional or long-horizon action spaces, and lack of end-to-end gradient flow.
- Diffusion Policies: Model a conditional denoising process across actions, capturing multi-modality via sequential refinement. These methods incur high computational cost and inference latency due to the necessity of denoising steps.
VQ-BeT replaces these discretization and modeling bottlenecks by introducing hierarchical (residual) vector quantization, enabling learned codebooks to capture both coarse and fine action modes. This approach provides end-to-end differentiability during quantizer training, robust multi-modality, and efficient inference through single-pass decoding (Lee et al., 2024).
2. Hierarchical Vector Quantization Module
VQ-BeT employs Residual Vector Quantization (RVQ) to tokenize continuous action chunks into a sequence of discrete latent codes suitable for transformer-based modeling.
- Encoder/Decoder Structure: An encoder maps action chunks to latent space; a decoder reconstructs actions from codebook outputs.
- Quantization: hierarchical quantization layers with codebooks enable residual decomposition:
The final quantized embedding: .
- Losses:
- Reconstruction: ,
- VQ loss: ,
- Commitment: ,
- Total RVQ-VAE loss: .
Codebook vectors are updated using an exponential moving average schedule to maintain stability, following [van den Oord et al. 2017]. This quantization infrastructure supports fine-grained capture of multi-modality through residual coding, with two-layer RVQ achieving 20–50% performance gains over vanilla VQ (Lee et al., 2024).
3. Transformer-Based Sequence Modeling
After RVQ-VAE pretraining, encoder, decoder, and codebooks are frozen. The core sequence model is an autoregressive, GPT-style transformer that predicts coded action sequences:
- Tokenization:
- Each chunk yields code tokens .
- Observation and (optional) goal tokens are embedded via a learned linear layer or CNN (ResNet18 for images).
- Code tokens are embedded with a learned table .
- Offset head predicts a small continuous correction .
- Architecture:
- transformer blocks apply standard multi-head self-attention and MLP layers, with positional encoding.
- Output heads provide logits for each code, and outputs the continuous offset.
- Objectives:
- Code prediction: FocalLoss with , for code classification,
- Offset reconstruction: ,
- Total: , with code prediction backprop only through transformer (RVQ-VAE parameters remain frozen).
- Regularization: Adam optimizer with warmup and cosine decay, epoch window $1000$–$2000$, transformer dropout of $0.1$.
4. Multimodality, Conditioning, and Inference
VQ-BeT supports diverse forms of generative modeling:
- Multimodality:
- The hierarchical code structure allows categorical output over coarse modes, refined by residual codes.
- Diversity is further encouraged using top- or nucleus sampling during generation.
- Entropy analysis on task completion validates that VQ-BeT captures behavioral diversity comparably or better than diffusion models.
- Conditioning and Partial Observability:
- Goal tokens can be prepended for conditional behavior generation (e.g., trajectory or state targets).
- Classifier-free guidance is realized by stochastically omitting goal tokens during training, facilitating interpolation at inference.
- For partial observations (e.g., ego-state plus objects in self-driving), VQ-BeT incorporates only available features.
- Efficiency:
- Single-pass transformer decoding contrasts with the multi-iteration denoising of diffusion policies, enabling inference speed in simulated tasks ($15$ ms/timestep vs. $75$–$100$ ms) and on real robot CPUs ($5$ ms vs. $125$ ms).
5. Empirical Evaluation and Performance
VQ-BeT was tested on multiple domains, including manipulation, locomotion, and autonomous driving, as well as both simulated and real-robot settings:
| Task Type | VQ-BeT Result | Best Comparator |
|---|---|---|
| PushT (state) | 0.78 (final IoU) | 0.73 (Diff-pol), 0.39 (BeT) |
| Ant multimodal (goals) | 3.22 | 3.12 (Diff-T), 2.73 (BeT) |
| Real-robot (1-phase) | 94% success | 90% (DiffPol-T) |
| Real-robot (2-phase) | 63% success | 37% (DiffPol-T) |
| Long-horizon (>3 subtasks) | higher success | - |
| nuScenes L₂ error (6 s) | 0.73 m | 0.74 m (GPT-Driver) |
VQ-BeT outperforms Behavior Cloning (BC), BeT, and diffusion policy baselines in the majority of tasks. The model achieves higher diversity (entropy of task completion order) without sacrificing accuracy. On the nuScenes driving dataset, it achieves trajectory prediction errors equivalent to leading autoregressive sequence models and exceeds diffusion-based control policies in both L₂ error and collision avoidance (Lee et al., 2024).
6. Architectural Ablations and Design Insights
Several design choices impact VQ-BeT's efficacy:
- Residual vs. Vanilla VQ: Two-layer RVQ produces 20–50% higher performance than single-layer VQ.
- Offset Head: Eliminating the offset head increases reconstruction error and reduces long-horizon task success by 30%.
- Autoregressive Code Prediction: Predicting code tokens in primary→secondary order boosts robustness to real-world noise.
- Code Weighting: Moderating secondary code losses (–$0.6$) stabilizes training and improves performance.
Limitations include the manual tuning of codebook size and depth , with large codebooks sometimes requiring dead-code masking. RVQ-VAE pretraining is a necessary but separate stage; future work may address this via joint learning (Lee et al., 2024).
7. Extensions, Impact, and Future Work
VQ-BeT presents a unified, efficient paradigm for learning and generating complex, multi-modal continuous behaviors under both full and partial observability. It is positioned for:
- Scalability: Potential to operate on web-scale datasets and learn transferable or shared representations (e.g., for large fleets of robots or human action data).
- Joint Learning: Prospects for end-to-end joint training of quantizer and transformer to allow adaptive codebook evolution.
- Fine-tuning: Integration of VQ-based reinforcement learning (VQ-RL) for offline RL policy improvement atop VQ-BeT priors.
These directions aim to further enhance robustness, efficiency, and generalization in sequential decision-making tasks (Lee et al., 2024).