Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Published 31 May 2025 in cs.CL, cs.AI, cs.LG, and cs.PF | (2506.00413v1)

Abstract: The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion LLMs (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper presents Adaptive Parallel Decoding (APD) that dynamically adjusts parallel token generation to improve speed over traditional autoregressive methods.
APD integrates a smaller autoregressive model to assess joint token probabilities, effectively balancing marginal and joint distributions.
Experiments demonstrate that APD boosts throughput while maintaining high-quality results, making diffusion LLMs viable for time-sensitive applications.

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Introduction

The paper "Accelerating Diffusion LLMs via Adaptive Parallel Decoding" introduces a novel technique, Adaptive Parallel Decoding (APD), aimed at enhancing the generation speed of diffusion LLMs (dLLMs). The current bottleneck in text generation lies in autoregressive decoding, which predicts tokens sequentially. Diffusion models, while theoretically capable of parallel generation, face quality degradation when exploiting full parallelism without additional modifications. APD addresses these challenges by dynamically adjusting the number of tokens sampled in parallel.

Background

Diffusion LLMs (dLLMs) present an intriguing alternative to autoregressive models, offering potential for parallelized token generation. dLLMs are inspired by the mechanisms of image diffusion models and seek to reverse the noise process added to data sequences. A core limitation is that dLLMs generate tokens based on marginal probabilities, often neglecting inter-token dependencies that autoregressive models inherently capture. Thereby, maintaining quality in parallelized generation poses a significant challenge.

Figure 1: Autoregressive vs Adaptive Parallel Decoding (APD).

Methodology

Adaptive Parallel Decoding (APD): APD redefines the decoding paradigm for dLLMs by integrating a smaller autoregressive model to evaluate joint token probabilities. The critical insight is that rather than fixing a predetermined number of tokens to generate in parallel, APD adaptively decides the parallelizable token count based on a "multiplicative mixture" of marginal and joint token probabilities. This approach allows APD to maintain fidelity to the target text distribution while efficiently sampling in parallel.

The procedure hinges on a universal coupling mechanism using shared randomness to sample from a target distribution, achieving a balance between generation speed and quality. Algorithmically, APD requires three tunable parameters: the multiplicative mixture weight, the recompute KV window, and the maximum masked lookahead. These parameters provide flexibility in managing the trade-off between throughput and quality.

Optimization Enhancements: The methodology also incorporates architectural optimizations such as KV caching and limiting the masked input size, which are traditionally used in autoregressive models to improve computational efficiency. By extending these techniques, APD significantly enhances the practical throughput of dLLMs.

Experimental Results

The experiments conducted validate that APD can deliver high throughput with minimal quality loss across tasks such as GSM8K. The findings highlight that for small values of the mixture weight parameter, the algorithm achieves impressive token parallelism rates while maintaining high generation quality (Figure 2).

Figure 2: Naive Parallel Generation. The tradeoff between parallelization and generation quality in diffusion models.

The Pareto frontier analysis demonstrates that APD offers a unique position by providing optimal configurations in terms of speed and quality. APD-enabled dLLMs not only outpace baseline dLLMs but also match and exceed the efficacies of autoregressive LLMs under certain configurations.

Conclusion

The introduction of Adaptive Parallel Decoding represents a tangible advancement in the field of LLM generation, offering a mechanism to reconcile the inherent parallel capabilities of diffusion models with the strong dependency modeling of autoregressive approaches. By doing so, APD enables dLLMs to be considered viable contenders in time-sensitive applications. Future research could explore further integration of APD with other architectural innovations and broader applications beyond standard benchmarks, potentially redefining efficient LLM deployment.

Markdown Report Issue