Energy-Based Diffusion Language Models for Text Generation
Abstract: Despite remarkable progress in autoregressive LLMs, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion LLM (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models' perplexity. We further show that, without any generation performance drop, our framework offers a 1.3$\times$ sampling speedup over existing diffusion models. Reproduced code is available at https://github.com/MinkaiXu/Energy-Diffusion-LLM.
- Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, 2021.
- Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 2020.
- A continuous time framework for discrete denoising models. In Advances in Neural Information Processing Systems, 2022.
- Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=kQwSbv0BR4.
- On contrastive divergence learning. In International workshop on artificial intelligence and statistics, pp. 33–40. PMLR, 2005.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685, 2018.
- Residual energy-based models for text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1l4SgHKDH.
- Diffusion models beat gans on image synthesis. In Neural Information Processing Systems, 2021.
- Continuous diffusion for categorical data. ArXiv, abs/2211.15089, 2022.
- Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024.
- Alan E Gelfand. Gibbs sampling. Journal of the American statistical Association, 95(452):1300–1304, 2000.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Bayesian flow networks. arXiv preprint arXiv:2308.07037, 2023.
- Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024.
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 297–304. JMLR Workshop and Conference Proceedings, 2010.
- John Hammersley. Monte carlo methods. Springer Science & Business Media, 2013.
- Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
- Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
- Autoregressive diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Lm8T39vLDTE.
- A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
- Discrete predictor-corrector diffusion models for image synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VM8batVBWvg.
- Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, pp. 32819–32848. PMLR, 2024.
- Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3698–3707, 2018.
- Matt Mahoney. Large text compression benchmark. 2006. URL https://www.mattmahoney.net/dc/text.html.
- Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyZoi-WRb.
- Art B. Owen. Monte Carlo theory, methods and examples. https://artowen.su.domains/mc/, 2013.
- The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Simple and effective masked diffusion language models, 2024.
- Step-unrolled denoising autoencoders for text generation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=T0GpzBQ1Fg6.
- Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024.
- Training and inference on any-order autoregressive models the right way. Advances in Neural Information Processing Systems, 35:2762–2775, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Discrete flows: Invertible generative models of discrete data. Advances in Neural Information Processing Systems, 32, 2019.
- Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
- Attention is all you need. In NIPS, 2017.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
- Improving and unifying discrete&continuous-time discrete denoising diffusion. arXiv preprint arXiv:2402.03701, 2024.
- Latent normalizing flows for discrete sequences. In International Conference on Machine Learning, pp. 7673–7682. PMLR, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.