- The paper introduces a novel hybrid approach combining masked discrete diffusion with continuous latents to capture joint dependencies and improve generation quality.
- It presents two variants (FUJI-LDDMs and SEQ-LDDMs) with tailored denoising steps, achieving lower perplexity on benchmarks like LM1B.
- The framework employs ELBO-based training and advanced noise scheduling to effectively address factorization issues in traditional discrete diffusion models.
Latent Discrete Diffusion Models
Introduction
The paper "Latent Discrete Diffusion Models" introduces Latent Discrete Diffusion Models (LDDMs) to overcome the limitations associated with masked denoisers in discrete diffusion processes, particularly for categorical data such as language. Traditional discrete diffusion models, while effective for generation tasks, often rely on factorized reverse transitions that disregard joint dependencies across tokens, leading to degraded performance in few-step generation.
LDDMs propose a hybrid approach, coupling a masked discrete diffusion process with a continuous diffusion over latent embeddings to capture cross-token dependencies. This dual-channel approach aims to provide a softer, more informative signal for learning and inference, thereby improving coherence and quality in generated outputs.
Latent Discrete Diffusion Models (LDDMs) Framework
The proposed LDDMs enhance masked discrete diffusion by incorporating a continuous latent space. The framework operates with three main components: a latent encoder, a forward noising process, and a denoising process, implemented in two core variants:
- FUJI-LDDMs: FUlly JoInt denoising models that simultaneously evolve both token and latent channels with shared interactions at each denoising step.
- SEQ-LDDMs: SEQuential denoising models that first resolve the continuous latent channel and subsequently condition the discrete chain on this latent representation, allowing each channel to be optimized for specific roles in the denoising process.
Both variants utilize Evidence Lower Bound (ELBO)-style objectives during training, accommodating for the unique learning dynamics presented by combining discrete and continuous processes.
Implementation Details
LDDMs require careful design of the following components:
- Latent Encoder: Can be pre-trained or custom-trained to extract informative latent representations from tokens. The encoder facilitates the learning of global dependencies across tokens, addressing the factorization issue in traditional masked discrete diffusions.
- Noise Scheduling and Loss Weighting: Appropriate choices of noise scheduling and ELBO or constant loss weighting significantly impact model stability and performance. The latent channel typically follows a variance-preserving (VP) cosine schedule.
- Training and Sampling Procedures: A two-stage training process is employed to prevent latent collapse, where initial stages prioritize discrete channel optimization with a gradual increase in latent channel importance.
For sampling, both deterministic and stochastic procedures can be utilized, with DDIM-style deterministic sampling potentially offering smoother generation paths.
Evaluation and Results
The paper evaluates LDDMs using tasks that range from controlled synthetic datasets to large-scale language modeling on the LM1B dataset. Key findings include:
- Synthetic Tasks: SEQ-LDDMs effectively capture conditionally factorized distributions with reduced steps on the data channel, demonstrating the advantage of leveraging latent conditioning.
- Language Modeling (LM1B): FUJI-LDDMs with a pre-trained latent encoder achieve lower generative perplexity and maintain similar entropy levels compared to state-of-the-art masked discrete diffusion models. This indicates improved few-step generation capability without significant entropic loss.
Conclusion
Latent Discrete Diffusion Models present a refined approach to discrete data generation, leveraging continuous latents to unify the strengths of discrete and continuous diffusion processes. By addressing the inherent limitations of factorized reverse transitions, LDDMs represent a promising direction for efficient and scalable generative modeling across various categorical data applications. Future work may explore more optimal encoder training schemes and adaptive noise scheduling to further enhance model performance.