Mercury: Ultra-Fast Language Models Based on Diffusion

Published 17 Jun 2025 in cs.CL, cs.AI, and cs.LG | (2506.17298v1)

Abstract: We present Mercury, a new generation of commercial-scale LLMs based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at https://platform.inceptionlabs.ai/ and free playground at https://chat.inceptionlabs.ai

Abstract PDF Upgrade to Chat

Summary

The paper introduces Mercury models that leverage diffusion processes to achieve ultra-fast performance in language modeling.
It employs a Transformer architecture with dynamic refinement over trillions of tokens, optimizing speed and quality on coding tasks.
Benchmark results show Mercury models outperform traditional autoregressive LLMs in throughput and latency while maintaining competitive quality.

Mercury: Ultra-Fast LLMs Based on Diffusion

The paper "Mercury: Ultra-Fast LLMs Based on Diffusion" introduces the Mercury family of diffusion-based LLMs (dLLMs) designed to significantly enhance the speed-quality trade-off in LLM implementations, with a particular emphasis on coding applications.

Introduction to Mercury Models

The Mercury models are built upon diffusion processes rather than traditional autoregressive methods, providing notable advantages in parallel token generation, speed, and efficiency. The adoption of diffusion models, traditionally successful in image and video generation, marks a novel approach to addressing the challenges of speed and control in LLMs for text. Mercury Coder models, specifically, implement this approach to enhance coding tasks, achieving higher throughput on standard GPU hardware like the NVIDIA H100.

Figure 1: Quality vs. Speed Trade-offs for Mercury Coder models, demonstrating superior throughput while maintaining quality.

Architecture and Training Details

The Mercury models leverage a Transformer architecture, optimized for diffusion processes. They engage in a dynamic refinement method, transforming random noise into data-consistent samples through learned denoising steps. This process enables significant speed improvements by utilizing high parallelization capabilities, thus elevating computational efficiency and arithmetic intensity.

The training regimen involves trillions of tokens collected from web crawls and proprietary datasets, conducted on large clusters of NVIDIA H100 GPUs. The use of diffusion provides fine-grained control over token transformations and allows for novel alignment and tuning methods, enhancing functionality in diverse generative tasks.

Performance Evaluation

Benchmark Comparisons

The Mercury models are evaluated on various standard code generation benchmarks, including HumanEval, MBPP, and MultiPL-E. These benchmarks assess code correctness, inference speed, and multi-language generation capabilities.

The Mercury Coder Mini, despite being a smaller model, achieves far superior throughput compared to open-weight models and even surpasses many established speed-optimized frontier models. This performance is coupled with maintaining competitive quality, as seen in numerous benchmarks across diverse programming languages.

Real World Implications

In practical applications, particularly those requiring rapid execution such as auto-completion and code snippets, Mercury models drastically improve latency and responsiveness. The advancement extends to scenarios such as Copilot Arena, where Mercury models are recognized for their low latency and user-preference in code completion tasks.

Adaptation and Inference

Mercury models support a range of generation modalities, such as zero-shot and few-shot prompting, and are fully adaptable to existing LLM use-cases with backward compatibility to existing APIs. This feature enables smooth transitions from traditional autoregressive models to Mercury's diffusion models, providing new opportunities for rapid iteration and deployment in latency-sensitive environments.

Implications and Future Directions

Mercury models embody significant advancements in diffusion-based LLMs, pushing the boundaries of what can be achieved in terms of speed without sacrificing quality. The diffusion approach allows for fine-tuned control and adaptability across different tasks, opening new research avenues in efficient, scalable AI implementations across diverse domains.

Conclusion

The Mercury models represent a significant step forward in LLM development by exploiting the strengths of diffusion processes. Their superior speed efficiency coupled with high-quality outputs in code-centric applications demonstrates their potential as robust and viable alternatives to current state-of-the-art autoregressive models. Future developments will likely focus on further scaling and adapting these models across more complex and diverse generative tasks.

Markdown Report Issue