Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Published 23 Oct 2024 in cs.CL | (2410.17891v3)

Abstract: Diffusion LLMs (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR LLMs, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (127M-355M-7B) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions https://github.com/HKUNLP/DiffuLLaMA.

Citations (1)

Summary

  • The paper introduces an adaptation technique that converts pre-trained AR models into diffusion language models using attention mask annealing and shift operations.
  • It scales models from 127M to 7B parameters with less than 200B tokens, achieving competitive performance in language modeling, reasoning, and commonsense benchmarks.
  • The findings demonstrate that diffusion models enable parallel, any-order text generation, offering practical advantages over traditional sequential AR models.

Scaling Diffusion LLMs via Adaptation from Autoregressive Models

The paper "Scaling Diffusion LLMs via Adaptation from Autoregressive Models" presents an innovative methodology for scaling Diffusion LLMs (DLMs) by leveraging pre-trained autoregressive (AR) LLMs. Diffusion models offer a promising alternative to AR models by facilitating parallel and any-order text generation, which could effectively address some limitations inherent in the sequential nature of AR models. However, until now, the scaling of DLMs has been limited by computational challenges and the absence of optimization techniques akin to those available for AR models.

Key Contributions

The authors propose a novel approach to adapt existing AR models into DLMs, bridging the objective differences between these paradigms. They introduce a technique involving attention mask annealing and shift operations to transform AR model architectures, enabling them to function effectively as diffusion models. This approach facilitates the seamless conversion of models such as GPT2 and LLaMA2, scaling from 127M to a substantial 7B parameters, utilizing less than 200 billion tokens for training.

Numerical Results and Benchmarks

The experimental analysis underscores that adapted DLMs not only exceed the performance of previous smaller diffusion models but also remain competitive with AR counterparts across a variety of tasks. Notably, the DiffuGPT models surpass GPT2, enhancing performance in language modeling, reasoning, and commonsense understanding benchmarks. The paper also highlights the superiority of DLMs in global reasoning tasks, such as mathematics and coding, where traditional AR models fall short. These findings are pivotal as they establish that scaling DLMs through adaptation is both feasible and beneficial for achieving state-of-the-art performance.

Implications and Future Directions

The successful adaptation and scaling of DLMs from AR models open significant pathways for future research in natural language processing. This work suggests that the constraints of sequential AR models might be overcome, ushering in models capable of parallel, flexible, and contextually dynamic text generation. Moreover, the computational efficiency, as indicated by the reduced latency of DiffuLLaMA for long-sequence generation, presents a practical advantage for deploying large-scale LLMs in real-world applications.

Future explorations could explore instruction tuning and inference-time planning methodologies to further harness the inherent capabilities of DLMs, potentially surpassing the limitations of AR models in more comprehensive and dynamic NLP tasks. Additionally, refining the fine-tuning and sampling techniques could enhance the adaptive qualities of DLMs across diverse domains and contexts.

Overall, this paper contributes a substantial advancement in AI research, highlighting the potential of DLMs as a viable alternative to AR language modeling by leveraging existing computational investments in pre-trained models. Such innovations will likely catalyze continued exploration and optimization of diffusion-based frameworks in text generation tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 125 likes about this paper.