Create a Video View Paper

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

This presentation explores BitDance, a breakthrough autoregressive model that achieves high-fidelity visual generation through binary token quantization with unprecedented entropy scaling up to 2^256. The talk examines how the binary diffusion head enables efficient sampling in massive discrete spaces, and how next-patch diffusion accelerates inference through parallel prediction—all while matching or exceeding the quality of continuous tokenizers and state-of-the-art diffusion models.

Script

What if we could push autoregressive models to handle 256-bit binary tokens—creating a vocabulary so vast it rivals the expressivity of continuous representations, yet remains discrete and controllable? BitDance does exactly that, bridging the gap between discrete autoregressive generation and the fidelity of continuous diffusion models.

Let's first understand the challenge that motivated this work.

Building on that challenge, autoregressive models have long relied on discrete tokens for stable generation, but increasing vocabulary size to improve fidelity hits a wall. Traditional vector quantization collapses under large codebooks, and directly modeling high-dimensional binary tokens requires exponentially growing parameters—making sampling intractable.

BitDance introduces two core innovations to overcome these barriers.

To tackle this, the authors introduce binary visual tokenization using Lookup-Free Quantization, which pushes vocabulary size to 2 to the power of 256 without collapse, achieving reconstruction quality on par with continuous models. Paired with this is the binary diffusion head, which treats tokens as hypercube vertices and learns their joint distribution through a rectified flow objective—sidestepping both parameter explosion and the loss of token dependencies.

Complementing the binary diffusion head is next-patch diffusion, which enables parallel token prediction by modeling patches jointly rather than assuming independence. This alignment of training and inference objectives preserves image structure while delivering dramatic speedups—up to 30 times faster at high resolutions with a fraction of the parameters used by competing methods.

Now let's examine how these innovations translate into performance.

This chart positions BitDance against leading diffusion models and autoregressive approaches, plotting generative quality on one axis and inference efficiency on the other. Notice how BitDance achieves competitive or superior image fidelity while maintaining the computational efficiency advantages of autoregressive generation—a combination that has historically been difficult to attain, especially at high resolutions.

Turning to the numbers, BitDance achieves an FID of 1.24 on ImageNet class-conditional generation, outperforming both autoregressive and diffusion transformers at comparable model sizes. On text-to-image tasks, the 14 billion parameter model delivers top-tier scores across multiple benchmarks, and distilled versions sustain generation quality while further accelerating inference through expanded parallel prediction.

These results reveal that increasing discrete token entropy is both feasible and beneficial when paired with scalable sampling mechanisms like the binary diffusion head. However, the approach requires careful co-scaling of vocabulary and model capacity, and while promising, extensions to video and multimodal reasoning remain open research directions that could further validate this framework's generality.

BitDance demonstrates that binary tokens with extreme entropy can match continuous representations in fidelity while preserving the stability and control of autoregressive generation—a paradigm shift for scalable visual synthesis. Visit EmergentMind.com to explore the full paper and dive deeper into this exciting work.