Latent Discrete Diffusion Models

Published 20 Oct 2025 in cs.LG, cs.AI, and stat.ML | (2510.18114v1)

Abstract: We study discrete diffusion for language and other categorical data and focus on a common limitation of masked denoisers: reverse transitions typically factorize across positions, which can weaken joint structure and degrade quality in few-step generation. We propose \emph{Latent Discrete Diffusion Models} (LDDMs), which couple a masked discrete diffusion over tokens with a continuous diffusion over latent embeddings. The latent channel provides a softer signal and carries cross-token dependencies that help resolve ambiguities. We present two instantiations: (i) FUJI-LDDMs, which perform fully joint denoising of tokens and latents, and (ii) SEQ-LDDMs, which sequentially resolve the latent and then the discrete chain conditionally on it. For both variants we derive ELBO-style objectives and discuss design choices to learn informative latents yet amenable to diffusoin modeling. In experiments, LDDMs yield improvements on unconditional generation metrics as compared to state-of-the-art masked discrete diffusion baselines, and are effective at lower sampling budgets, where unmasking many tokens per step is desirable.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel hybrid approach combining masked discrete diffusion with continuous latents to capture joint dependencies and improve generation quality.
It presents two variants (FUJI-LDDMs and SEQ-LDDMs) with tailored denoising steps, achieving lower perplexity on benchmarks like LM1B.
The framework employs ELBO-based training and advanced noise scheduling to effectively address factorization issues in traditional discrete diffusion models.

Latent Discrete Diffusion Models

Introduction

The paper "Latent Discrete Diffusion Models" introduces Latent Discrete Diffusion Models (LDDMs) to overcome the limitations associated with masked denoisers in discrete diffusion processes, particularly for categorical data such as language. Traditional discrete diffusion models, while effective for generation tasks, often rely on factorized reverse transitions that disregard joint dependencies across tokens, leading to degraded performance in few-step generation.

LDDMs propose a hybrid approach, coupling a masked discrete diffusion process with a continuous diffusion over latent embeddings to capture cross-token dependencies. This dual-channel approach aims to provide a softer, more informative signal for learning and inference, thereby improving coherence and quality in generated outputs.

Latent Discrete Diffusion Models (LDDMs) Framework

The proposed LDDMs enhance masked discrete diffusion by incorporating a continuous latent space. The framework operates with three main components: a latent encoder, a forward noising process, and a denoising process, implemented in two core variants:

FUJI-LDDMs: FUlly JoInt denoising models that simultaneously evolve both token and latent channels with shared interactions at each denoising step.
SEQ-LDDMs: SEQuential denoising models that first resolve the continuous latent channel and subsequently condition the discrete chain on this latent representation, allowing each channel to be optimized for specific roles in the denoising process.

Both variants utilize Evidence Lower Bound (ELBO)-style objectives during training, accommodating for the unique learning dynamics presented by combining discrete and continuous processes.

Implementation Details

LDDMs require careful design of the following components:

Latent Encoder: Can be pre-trained or custom-trained to extract informative latent representations from tokens. The encoder facilitates the learning of global dependencies across tokens, addressing the factorization issue in traditional masked discrete diffusions.
Noise Scheduling and Loss Weighting: Appropriate choices of noise scheduling and ELBO or constant loss weighting significantly impact model stability and performance. The latent channel typically follows a variance-preserving (VP) cosine schedule.
Training and Sampling Procedures: A two-stage training process is employed to prevent latent collapse, where initial stages prioritize discrete channel optimization with a gradual increase in latent channel importance.

For sampling, both deterministic and stochastic procedures can be utilized, with DDIM-style deterministic sampling potentially offering smoother generation paths.

Evaluation and Results

The paper evaluates LDDMs using tasks that range from controlled synthetic datasets to large-scale language modeling on the LM1B dataset. Key findings include:

Synthetic Tasks: SEQ-LDDMs effectively capture conditionally factorized distributions with reduced steps on the data channel, demonstrating the advantage of leveraging latent conditioning.
Language Modeling (LM1B): FUJI-LDDMs with a pre-trained latent encoder achieve lower generative perplexity and maintain similar entropy levels compared to state-of-the-art masked discrete diffusion models. This indicates improved few-step generation capability without significant entropic loss.

Conclusion

Latent Discrete Diffusion Models present a refined approach to discrete data generation, leveraging continuous latents to unify the strengths of discrete and continuous diffusion processes. By addressing the inherent limitations of factorized reverse transitions, LDDMs represent a promising direction for efficient and scalable generative modeling across various categorical data applications. Future work may explore more optimal encoder training schemes and adaptive noise scheduling to further enhance model performance.