Papers
Topics
Authors
Recent
Search
2000 character limit reached

VQ-BeT: Transformer-Based Behavior Cloning

Updated 14 February 2026
  • The paper presents a novel transformer architecture that replaces traditional k-means discretization with a two-stage residual VQ-VAE for end-to-end behavior cloning.
  • It achieves state-of-the-art performance with 5× to 25× speedup over diffusion models, demonstrating high fidelity in complex manipulation and autonomous driving tasks.
  • The method leverages efficient single-pass inference and gradient-driven quantization to enhance scalability and real-world application in robotics.

Vector-Quantized Behavior Transformer (VQ-BeT) is a transformer-based approach to multi-modal behavior cloning that advances the ability to model continuous, high-dimensional, and multi-modal action distributions in sequential decision-making tasks. By replacing traditional k-means action discretization with a hierarchical residual vector-quantization module, VQ-BeT significantly improves both fidelity and scalability in generating complex behaviors from demonstration datasets. This method offers state-of-the-art performance across manipulation, locomotion, and real-world robotic environments, and achieves fast, single-pass inference for action generation (Lee et al., 2024).

1. Architectural Foundations

VQ-BeT employs a causal transformer backbone—specifically a lightweight “MiniGPT” stack with L=6L=6 layers, H=6H=6 attention heads, and token embeddings of dimension dmodel=120d_{model}=120. Learned positional embeddings are added to tokens. The input sequence concatenates a sliding history of hh past observations, (optionally) future or goal observations for conditional tasks, and the next nn actions or action chunks to be predicted.

Observation embeddings are modality-specific: state-vectors are projected via a small MLP, while images are encoded with a pretrained ResNet-18 or, on real robots, an HPR backbone. For actions, instead of direct embedding, VQ-BeT leverages a VQ-VAE encoder ϕ\phi, mapping at:t+na_{t:t+n} into a latent xRdlatentx\in\mathbb{R}^{d_{latent}}, which is then quantized into discrete codes by the vector-quantization module (details in Section 2). At each time step, the transformer predicts the next discrete latent code(s) as well as a small continuous “offset” for high-fidelity action reconstruction. All inference is performed in a single transformer pass per step (Lee et al., 2024).

2. Hierarchical Vector Quantization for Actions

A novel two-layer residual VQ-VAE module tokenizes continuous action chunks into hierarchical, discrete latent codes. The encoder ϕ\phi projects at:t+na_{t:t+n} to xx, which is then quantized via two stages:

  • First quantizer: zq1=ec11z_q^1 = e^1_{c_1}, with c1=argminjxej12c_1 = \arg\min_{j} \|x - e^1_j\|_2.
  • Residual: r1=xzq1r^1 = x - z_q^1.
  • Second quantizer: zq2=ec22z_q^2 = e^2_{c_2}, with c2=argminjr1ej22c_2 = \arg\min_{j} \|r^1 - e^2_j\|_2.

The final quantized latent is zq(x)=zq1+zq2z_q(x) = z_q^1 + z_q^2. Decoding is performed with ψ(zq(x))\psi(z_q(x)). The VQ-VAE is trained with a composite loss: LVQ-VAE=LRecon+i=1,2xSG[ecii]22+λcommitSG[x]ecii22L_\text{VQ-VAE} = L_\text{Recon} + \sum_{i=1,2} \| x - \mathrm{SG}[e^i_{c_i}] \|_2^2 + \lambda_\text{commit} \|\mathrm{SG}[x] - e^i_{c_i} \|_2^2 where LReconL_\text{Recon} is the 1\ell_1 reconstruction loss after decoding, SG[]\mathrm{SG}[\cdot] is the stop-gradient operator, and λcommit=1\lambda_\text{commit}=1. Centroids are updated via exponential moving average (EMA). The non-differentiable quantization step uses the “straight-through estimator” for gradient flow with respect to ϕ\phi, treating zq=xz_q = x in the backward pass (Lee et al., 2024).

3. Training and Inference Objectives

After pretraining and freezing the VQ-VAE, all action chunks in the dataset are mapped to their two-stage code indices (c1,c2)(c_1, c_2). The transformer is trained to predict these indices and a continuous offset for action reconstruction. For each quantizer stage i=1,2i=1,2, a classification head ζcodei\zeta^i_{code} predicts a categorical distribution over the KK codebook entries. A focal (cross-entropy) loss is used for each code: Lcode=Lfocal(ζcode1(oth:t),c1)+βLfocal(ζcode2(oth:t),c2)L_\text{code} = L_\text{focal}( \zeta^1_{code}(o_{t-h:t}), c_1 ) + \beta \cdot L_\text{focal}( \zeta^2_{code}(o_{t-h:t}), c_2 ) where β\beta (typically $0.1$) balances primary and secondary code importance. The offset prediction is supervised by

Loffset=at:t+n(ψ(ec11+ec22)+ζoffset)1L_\text{offset} = \| a_{t:t+n} - (\psi(e^1_{c_1} + e^2_{c_2}) + \zeta_{offset}) \|_1

The total loss for VQ-BeT is Ltotal=Lcode+LoffsetL_\text{total} = L_\text{code} + L_\text{offset}.

Inference involves embedding past observations, predicting discrete code distributions and offsets in a single forward pass, sampling or selecting the most probable codes, and reconstructing actions via the VQ-VAE decoder and offset addition. The computational efficiency of single-pass inference is a central benefit (Lee et al., 2024).

4. Comparison with Prior and Baseline Methods

VQ-BeT is an extension of the Behavior Transformer (BeT) framework (Shafiullah et al., 2022), which uses k-means quantization to discretize the action space and a multi-head transformer to predict bins and continuous offsets. Although BeT effectively captures multimodality through action clustering, it suffers from scaling limitations in high-dimensional action spaces, absence of gradient optimization of quantization, and manual selection of kk (number of bins). VQ-BeT addresses these issues by:

  • Employing learned, hierarchical codebooks within VQ-VAE for end-to-end quantization.
  • Gaining gradient flow via the straight-through estimator, adapting codebooks to optimize behavior reconstruction.
  • Achieving finer quantization with far fewer centroids using coarse-to-fine residual quantization.

Compared to diffusion-policy models (Lee et al., 2024), which perform iterative denoising for multimodal action sampling (requiring $10$–$20$ network passes per action and thus slow inference, around $100$ ms), VQ-BeT requires only a single pass (approximately $15$–$25$ ms), resulting in 5×5\times to 25×25\times speedup (Lee et al., 2024).

5. Empirical Results and Benchmarks

VQ-BeT has demonstrated state-of-the-art or comparable performance across a diverse set of environments:

  • Simulated Manipulation: PushT (with images), Kitchen, BlockPush, UR3 BlockPush.
  • Locomotion: Multimodal Ant.
  • Autonomous Driving: nuScenes dataset.
  • Real-World Robotics: Single-phase, two-phase, and extended-horizon tasks on the Stretch manipulator platform.

Key results indicate:

  • State-of-the-art success on $5/7$ unconditional and $6/7$ conditional tasks.
  • +10+1025%25\% relative success increase over BeT; surpasses DiffusionPolicy in long-horizon problems.
  • Improved behavior entropy on $5/7$ tasks vs. all baselines, reflecting greater diversity in generated behaviors.
  • Real-robot, long-horizon task success: VQ-BeT achieves $60$–80%80\% per subtask vs. $10$–30%30\% for DiffusionPolicy.
  • Inference speed per action: $15$–$25$ ms for VQ-BeT vs. 100\sim 100 ms for diffusion (Lee et al., 2024).

6. Implementation and Practical Considerations

For deploying VQ-BeT, the following procedural recommendations are made:

  1. Collect dataset D={(ot,at)}D = \{(o_t, a_t)\}.
  2. Pretrain the VQ-VAE (Nq=2N_q=2 stages, K=8K=8–$16$ per codebook, λcommit=1\lambda_\text{commit}=1, latent dimension 512\sim512), minimize LVQ-VAEL_\text{VQ-VAE}.
  3. Discretize the dataset: map every action chunk to (c1,c2)(c_1, c_2).
  4. Build transformer input sequences of observation tokens, (optional) goal tokens, and code tokens.
  5. Train the transformer (learning rate 5×1055\times 10^{-5}, batch size $128$, $300$–$2000$ epochs), optimizing LtotalL_\text{total}.
  6. At inference: single-pass to predict code indices and offset, decode and apply action, slide window forward (Lee et al., 2024).

Critical hyperparameters include Nq=2N_q=2, K=8K=8–$16$, β0.1\beta \approx 0.1, and sequence lengths h=5h=5–$100$, n=1n=1–$10$. Increasing KK or NqN_q helps scale to higher action dimensions. Always including the 1\ell_1 offset head is crucial, as omission yields a $10$–15%15\% drop in fidelity.

7. Significance, Extensions, and Limitations

VQ-BeT advances the modeling of complex sequential behaviors by integrating learned, hierarchical action discretization with scalable, efficient transformer architectures. Its capacity for handling multi-modal, high-dimensional action prediction extends applicability to real-world robotics and autonomous driving, with demonstrated improvements over both k-means-discretized BeT (Shafiullah et al., 2022) and iterative denoising methods. The design enables gradient-based refinement of action codebooks and fast inference suitable for control in latency-sensitive environments.

This suggests that vector-quantized paradigms may supplant static clustering approaches in high-dimensional behavior cloning. For action spaces of exceptional dimensionality or complexity, increasing the quantization depth (NqN_q) or codebook sizes (KK) may further expand representational capacity. Including the offset regression head is essential for full-fidelity control trajectories.

Limitations include the requirement for careful codebook size and structure selection to prevent dead codes, and challenges in maintaining precision with aggressive action chunking in noisy hardware. There are no explicit reward-compositionality mechanisms; all learning is derived from raw demonstration data without reward supervision (Lee et al., 2024).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VQ-BeT Transformer-Based Behavior Cloning.