Analyze and remedy anomalous gradient and residual dynamics in RVQ-VAE quantization
Investigate the gradient flow dynamics in the residual vector quantized variational auto-encoder (RVQ-VAE) used by JHCodec, specifically the propagation through residual vector quantizers with input and output projections as characterized by the derived gradient expression, to explain why the overall gradient norms are on the order of 100, why the residual norm r_k does not consistently decrease as the number of quantization stages k increases, and why the matrix I minus W_out transposed times W_in transposed does not converge toward the zero matrix; and develop training objectives or architectural modifications that ensure stable and theoretically consistent behavior.
References
The overall gradient norm of the system is on the order of $10{2}$, whereas standard Transformer decoders typically have gradient norms below 1. However, the model can still learn effective speech generation using the RVQ-VAE framework. We hypothesize that this behavior arises from the RVQ quantization formulation or suboptimal gradient flow from Eq. \ref{eq:rvq_grad}. Notably, we expect the norm of residual $\mathbf{r}k$ decreases as $k$ increases. However, in practice, the residual norm does not consistently decrease. Moreover, $\mathbf{I} -\mathbf{W}\top{k,\mathrm{out} \mathbf{W}\top_{k,\mathrm{in}$ does not converge toward the zero matrix, as no explicit objective enforces this behavior. A more detailed analysis and potential remedies for this issue are left for future work.