Analyze and remedy anomalous gradient and residual dynamics in RVQ-VAE quantization

Investigate the gradient flow dynamics in the residual vector quantized variational auto-encoder (RVQ-VAE) used by JHCodec, specifically the propagation through residual vector quantizers with input and output projections as characterized by the derived gradient expression, to explain why the overall gradient norms are on the order of 100, why the residual norm r_k does not consistently decrease as the number of quantization stages k increases, and why the matrix I minus W_out transposed times W_in transposed does not converge toward the zero matrix; and develop training objectives or architectural modifications that ensure stable and theoretically consistent behavior.

Background

The paper derives an explicit expression for gradient propagation through residual vector quantization with input and output projections and observes unexpected training behavior: unusually large overall gradient norms, non-monotonic residual norms across RVQ stages, and lack of convergence in a key projection-related matrix. These phenomena suggest potential issues in the RVQ quantization formulation or suboptimal gradient flow.

The authors report that, contrary to expectation, the residual norm does not consistently decrease as additional quantization stages are applied and that the matrix I − W_outT W_inT does not approach zero, indicating possible inefficiencies or instability in how residuals are refined and how projections interact during training. They explicitly defer a deeper theoretical and empirical analysis, as well as the development of remedies, to future work.

References

The overall gradient norm of the system is on the order of $10{2}$, whereas standard Transformer decoders typically have gradient norms below 1. However, the model can still learn effective speech generation using the RVQ-VAE framework. We hypothesize that this behavior arises from the RVQ quantization formulation or suboptimal gradient flow from Eq. \ref{eq:rvq_grad}. Notably, we expect the norm of residual $\mathbf{r}k$ decreases as $k$ increases. However, in practice, the residual norm does not consistently decrease. Moreover, $\mathbf{I} -\mathbf{W}\top{k,\mathrm{out} \mathbf{W}\top_{k,\mathrm{in}$ does not converge toward the zero matrix, as no explicit objective enforces this behavior. A more detailed analysis and potential remedies for this issue are left for future work.