A Survey on Diffusion Language Models

Published 14 Aug 2025 in cs.CL, cs.AI, and cs.LG | (2508.10875v1)

Abstract: Diffusion LLMs (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked LLMs, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces diffusion language models as an alternative to autoregressive methods by leveraging iterative denoising for parallel token generation.
The paper details training and inference strategies, including complementary masking, caching, and step distillation, achieving competitive benchmarks in language, code, and multimodal tasks.
The paper outlines key open challenges such as scalability, infrastructure development, and long-sequence handling that must be addressed for broader adoption.

A Comprehensive Survey of Diffusion LLMs

Introduction and Motivation

Diffusion LLMs (DLMs) have emerged as a compelling alternative to the autoregressive (AR) paradigm for language generation, leveraging iterative denoising processes to enable parallel token generation and bidirectional context modeling. This survey systematically reviews the evolution, taxonomy, training and inference strategies, multimodal extensions, empirical performance, and open challenges of DLMs, providing a technical synthesis for researchers and practitioners.

Figure 1: Timeline of Diffusion LLMs, highlighting the shift from continuous to discrete and multimodal DLMs.

Evolution and Taxonomy of DLMs

The development of DLMs can be categorized into three main groups: continuous DLMs, discrete DLMs, and multimodal DLMs. Early research focused on continuous-space models, where diffusion operates in the embedding or logit space. Discrete DLMs, which define the diffusion process directly over token vocabularies, have gained traction due to their scalability and compatibility with large-scale language modeling. Recent advances have extended DLMs to multimodal domains, enabling unified modeling of text, images, and other modalities.

Figure 2: Research trend showing the increasing number of DLM papers, especially in discrete and multimodal settings.

Modeling Paradigms and Architectural Distinctions

DLMs are positioned within the broader landscape of language modeling paradigms, which include masked LLMs (MLMs), AR models, permutation LLMs, and sequence-to-sequence architectures. DLMs distinguish themselves by their iterative, non-sequential generation process, which allows for parallelism and bidirectional context utilization. Continuous DLMs operate in embedding or logit spaces, while discrete DLMs employ token-level corruption and denoising, often using masking strategies.

Figure 4: Overview of training and inference procedures across AR, continuous DLM, discrete DLM, and block-wise hybrid models.

Training and Post-Training Strategies

Pre-training

DLMs are typically pre-trained using objectives analogous to those in AR or image diffusion models. Discrete DLMs often initialize from AR model weights (e.g., LLaMA, Qwen2.5), facilitating efficient adaptation and reducing training cost. Continuous DLMs may leverage pretrained image diffusion backbones for multimodal tasks.

Supervised Fine-Tuning and RL Alignment

Supervised fine-tuning (SFT) in DLMs mirrors AR approaches but must address the inefficiency of loss computation due to partial masking. Techniques such as complementary masking and improved scheduling have been proposed to enhance gradient flow and data utilization.

Post-training for reasoning capabilities is a critical area, with methods such as Diffusion-of-Thought (DoT), DCoLT, and various policy gradient adaptations (e.g., diffu-GRPO, UniGRPO, coupled-GRPO) enabling DLMs to perform complex reasoning and alignment tasks. Preference optimization methods (e.g., VRPO) have also been adapted to the diffusion setting, addressing the high variance of ELBO-based log-likelihood approximations.

Inference Techniques and Efficiency Optimizations

Inference in DLMs is characterized by a rich set of strategies aimed at balancing quality, controllability, and efficiency:

Parallel Decoding: Confidence-aware and adaptive parallel decoding methods enable substantial speed-ups (up to 34×) with minimal quality loss.
Unmasking/Remasking: Adaptive policies for token selection and remasking improve both convergence and output coherence.
Guidance: Classifier-free guidance and structural constraints steer generation toward desired attributes, with extensions for semantic and syntactic control.
Caching and Step Distillation: Innovations in KV and feature caching, as well as step distillation, have closed much of the inference latency gap with AR models, achieving up to 500× acceleration in some cases.
Figure 3: Inference techniques for DLMs, including parallel decoding, unmasking/remasking, guidance, caching, and step distillation.

Multimodal and Unified DLMs

Recent work has extended DLMs to multimodal and unified architectures, supporting both understanding and generation across text and vision. Approaches include:

Vision Encoders + DLMs: Models like LLaDA-V and LaViDa integrate vision encoders with DLM backbones, employing complementary masking and KV-caching for efficient training and inference.
Unified Token Spaces: MMaDA and UniDisc tokenize all modalities into a shared space, enabling joint modeling and cross-modal reasoning.
Hybrid Training: Dimple employs an autoregressive-then-diffusion training regime to stabilize multimodal learning and enable parallel decoding.

These models demonstrate competitive or superior performance to AR-based multimodal models, particularly in cross-modal reasoning and generation.

Empirical Performance and Benchmarking

DLMs have achieved performance on par with, and in some cases exceeding, AR models of similar scale across a range of benchmarks, including language understanding (PIQA, HellaSwag), code generation (HumanEval), mathematical reasoning (GSM8K), and multimodal tasks (MME, MMMU). Notably, DLMs exhibit stronger performance in math and science-related benchmarks and demonstrate superior throughput in code generation and multimodal settings.

Figure 5: Performance comparison on eight benchmarks, showing DLMs (orange) competitive with AR models (blue) across tasks and scales.

Trade-offs, Limitations, and Open Challenges

Despite their promise, DLMs face several unresolved challenges:

Parallelism–Performance Trade-off: Increased parallelism can degrade output coherence due to inter-token dependency issues, especially with aggressive unmasking schedules.
Figure 6: Generation results illustrating the trade-off between parallelism and output quality in DLMs.
Infrastructure and Ecosystem: Lack of mature, open-source libraries and deployment frameworks hinders practical adoption.
Long-Sequence and Dynamic-Length Generation: DLMs are limited in context length and dynamic output sizing, with cubic inference complexity in sequence length.
Scalability: Public DLMs remain significantly smaller than state-of-the-art AR models, and scaling laws for DLMs are not yet fully established.

Applications and Implications

DLMs have been successfully applied to a broad spectrum of tasks, including robust text classification, NER, summarization, style transfer, code generation, and computational biology. Their global planning and iterative refinement capabilities are particularly advantageous for structured and logic-heavy domains. In code generation, DLMs have demonstrated competitive HumanEval performance and superior throughput compared to AR baselines.

Theoretically, DLMs offer a unified framework for generative modeling across modalities, with inherent advantages in controllability and bidirectional context. Practically, their parallelism and efficiency optimizations position them as viable candidates for latency-sensitive and large-scale applications, contingent on further advances in infrastructure and scalability.

Future Directions

Key areas for future research include:

Improving training efficiency and data utilization.
Adapting quantization, pruning, and distillation techniques to the diffusion paradigm.
Advancing unified multimodal reasoning and DLM-based agent architectures.
Developing robust infrastructure and deployment frameworks for DLMs.

Conclusion

This survey provides a comprehensive technical synthesis of the DLM landscape, highlighting the paradigm's modeling innovations, empirical strengths, and open challenges. DLMs have established themselves as a credible alternative to AR models, particularly in settings demanding parallelism, bidirectional context, and unified multimodal reasoning. Continued research into scalability, efficiency, and infrastructure will be critical for realizing the full potential of diffusion-based language modeling.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is a survey, which means it reviews and explains a whole field rather than presenting just one new model. It focuses on Diffusion LLMs (DLMs) — a newer way to make AI write text. Unlike the usual “one word at a time” approach, DLMs create many words in parallel by starting with a messy version of a sentence and cleaning it up step by step. The authors explain how DLMs work, how they’re trained, how they’re used, what makes them fast and controllable, and where they still struggle.

What questions does it try to answer?

To make the topic easy to follow, the paper looks at several simple questions:

How do DLMs compare to older styles like autoregressive (AR) models (which generate one token at a time) and masked LLMs (which guess missing words)?
What kinds of DLMs exist, and how are they built? (For example, some work with numbers behind the scenes, others work directly with words.)
How are DLMs trained and sped up to be practical?
How well do DLMs perform now, and what can they be used for (text, code, even combining text with images)?
What are the current challenges and what should researchers work on next?

How did the authors study it?

This is a structured review of the field. Here’s the approach in everyday language:

They map the big picture: The authors organize DLMs into a “taxonomy” (a family tree of methods), charting how the ideas evolved over time, and how DLMs relate to AR models (like GPT) and masked models (like BERT).
They explain two main flavors of DLMs:
- Continuous DLMs: These turn words into vectors (lists of numbers called embeddings) and “add noise” (like static on a radio) to them. The model learns to remove the noise in steps, ending with clean text.
- Discrete DLMs: These work directly with tokens (words/pieces of words). They often use a special [MASK] token and learn to fill in the blanks, unmasking the most confident words over several rounds.
They describe training and tuning: How models are trained from scratch, or adapted from existing models (for example, taking a GPT-like model and teaching it to do diffusion), then fine-tuned with supervised examples or improved with reinforcement learning to follow instructions and reason better.
They cover speed-up tricks for generation (inference):
- Parallel decoding (accept several tokens at once).
- Unmasking/remasking (reveal confident words, keep uncertain parts hidden, then try again).
- Caching (save earlier computations so you don’t redo them).
- Guidance (steer the style or correctness of the output).
- Few-step samplers and distillation (generate high quality text with fewer steps).
They survey multimodal extensions: Models that use both text and images using the same diffusion idea.
They summarize applications and limitations across many papers and benchmarks.

Think of it like reviewing many puzzle-solving strategies: AR models fill in the crossword one square at a time from left to right; DLMs fill lots of squares at once, check, then refine the tricky parts over a few rounds.

What did they find, and why does it matter?

Here are the main takeaways, explained simply:

DLMs can be fast: Because they can generate multiple tokens in parallel, DLMs can be several times faster than the classic “one token at a time” approach. Some industry reports even show thousands of tokens per second.
They use both left and right context: DLMs naturally look at words before and after a blank. This “bidirectional” view can lead to more coherent and controllable text, especially for tasks like filling gaps or following structure.
Iterative refinement helps quality: DLMs don’t have to be perfect in one shot. They guess, lock in the confident parts, and keep improving the uncertain parts. This often produces clearer, more consistent text.
They are getting competitive with big AR models: Recent large DLMs (like LLaDA-8B, Dream-7B, DiffuLLaMA) are now close to or on par with strong autoregressive models of similar size for many tasks, while offering speed and control advantages.
They work well beyond plain text: The same diffusion idea that powers image models (like Stable Diffusion) also helps DLMs connect text and images. New “multimodal” DLMs can understand and generate across both.
Training recipes are maturing: You can train DLMs from scratch, adapt from existing AR models, or even borrow ideas from image diffusion models. Post-training with supervised data and reinforcement learning improves instruction-following and reasoning.
Practical optimizations matter a lot: Techniques like parallel decoding, smart unmasking/remasking, and caching make a big difference in speed and quality, making DLMs more usable in real systems.

Why it matters: If you want chatbots, code assistants, or multimodal tools that are faster, more controllable, and able to refine their answers, DLMs are a promising path forward.

What challenges remain?

While DLMs are exciting, the survey also highlights hurdles:

Efficiency and infrastructure: Even with parallelism, DLMs can need careful engineering (special caches, clever schedules) and strong hardware to run at their best.
Long documents and memory: Handling very long inputs is still hard, though early results (like LongLLaDA) look promising with smart position tricks.
Reasoning and planning: Getting DLMs to plan multi-step answers (like math proofs or long explanations) is improving but not solved; new training and decoding methods help.
Generation steps: Diffusion is iterative by nature. Reducing the number of steps without hurting quality is a key goal.
Dynamic lengths and discrete choices: Working directly with tokens makes some parts (like deciding length or revising past tokens) tricky—new designs are addressing this.

Why does this research matter in the big picture?

Faster, more responsive AI: DLMs can cut the wait time for text generation, which is crucial for real-time apps (chat, coding, search).
Better control and editing: Because DLMs refine text in rounds and can “look both ways,” they’re naturally good at filling gaps, editing, and following structure or style.
One framework for text and images: A shared diffusion idea across text and vision could lead to simpler, more unified AI systems that understand and generate in several modalities.
New frontiers: The survey points to many applications—writing, coding, information extraction, dialog, and even biology (like designing proteins)—where DLMs are already being tested.

In short, this paper maps the fast-growing world of Diffusion LLMs. It shows that DLMs are moving from a cool idea to practical systems: they’re faster, more flexible, increasingly powerful, and ready to be used alongside (and sometimes instead of) traditional models. The field is advancing quickly, and this survey acts like a guidebook for anyone who wants to understand where DLMs are now and where they’re headed next.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (4)

Collections

GitHub

GitHub - VILA-Lab/Awesome-DLMs: The official GitHub repo for the survey paper "A Survey on Diffusion Language Models". (5 stars)

Tweets

A Survey on Diffusion Language Models

Summary

A Comprehensive Survey of Diffusion LLMs

Introduction and Motivation

Evolution and Taxonomy of DLMs

Modeling Paradigms and Architectural Distinctions

Training and Post-Training Strategies

Pre-training

Supervised Fine-Tuning and RL Alignment

Inference Techniques and Efficiency Optimizations

Multimodal and Unified DLMs

Empirical Performance and Benchmarking

Trade-offs, Limitations, and Open Challenges

Applications and Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does it try to answer?

How did the authors study it?

What did they find, and why does it matter?

What challenges remain?

Why does this research matter in the big picture?

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

GitHub

Tweets

YouTube

HackerNews

alphaXiv