Scaling autoregressive model capacity with increasing iFSQ codebook size

Establish whether increasing the implicit codebook size L^d of the iFSQ tokenizer—by raising the bits per dimension K in L = 2^K + 1—requires proportionally scaling the capacity of the autoregressive transformer (such as LlamaGen) to maintain or improve image generation quality, and characterize how model capacity should scale as the iFSQ codebook grows.

Background

The paper investigates iFSQ as a unified tokenizer for both diffusion and autoregressive models. In the autoregressive setting (e.g., LlamaGen), iFSQ’s implicit codebook size is determined by the number of quantization levels per dimension (L = 2K + 1) and the latent dimensionality d, yielding an effective codebook size of Ld.

Empirically, the authors report that generation performance peaks around 4 bits per dimension and does not monotonically improve as the number of bits increases. Based on this observation, they explicitly conjecture that as the implicit codebook grows, the autoregressive model may need greater capacity to effectively predict tokens drawn from a larger discrete space.

References

Conjecture that as the codebook grows, the corresponding autoregressive model must also scale to provide sufficient capacity to predict such a large codebook.

iFSQ: Improving FSQ for Image Generation with 1 Line of Code  (2601.17124 - Lin et al., 23 Jan 2026) in Section 4.1.3, Subsubsection "iFSQ for Auto-regressive Image Generation" (Comparison of different bits within iFSQ)