One Token Is Enough: Improving Diffusion Language Models with a Sink Token

Published 27 Jan 2026 in cs.CL and cs.AI | (2601.19657v2)

Abstract: Diffusion LLMs (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a stable sink token that anchors shifting attention sinks in Diffusion Language Models, enhancing model robustness.
It employs a modified attention mask in the input sequence to absorb excess attention without semantic interference.
Experimental results demonstrate significant performance improvements, especially in intermediate transformer layers.

Enhancing Diffusion LLMs with a Stable Sink Token

Introduction

The paper "One Token Is Enough: Improving Diffusion LLMs with a Sink Token" (2601.19657) addresses a critical aspect of Diffusion LLMs (DLMs), which are increasingly utilized in parallel text generation. A key challenge in DLMs is the moving sink phenomenon, which destabilizes the inference process by causing erratic shifts in attention sink positions over diffusion timesteps. Unlike autoregressive models, which inherently stabilize attention distribution through their structure, DLMs lack a consistent mechanism to manage attention flow efficiently. This paper proposes a novel solution by introducing a stable sink token to address these instabilities and enhance model performance.

Moving Sink Phenomenon

In DLMs, attention typically drifts towards low-norm tokens, often serving as implicit sinks due to their minimal semantic content. This behavior is visualized in moving attention sinks, indicating shifting attention focus across diffusion steps (Figure 1). The lack of a fixed anchor for attention results in decreased inference robustness, a sharp contrast to autoregressive models where a consistent seed token stabilizes attention distribution.

Figure 1: Moving attention sinks and their low-norm property in diffusion LLMs. Attention maps demonstrate shifting sink positions.

Methodology

The authors propose an extra sink token integrated into the input sequence via a modified attention mask. This token is designed to absorb excess attention while maintaining negligible semantic content, acting as a structural anchor. This approach transforms the moving sink into a stable one, as illustrated in the overview of the phenomenon (Figure 2). The paper highlights how this method significantly stabilizes the model without being position-dependent, marking a distinct advantage over variable sink placements.

Figure 2: Overview of the moving-sink phenomenon and the implementation of a stable sink token.

Experimental Results

The paper's experiments highlight substantial improvements in DLM performance metrics across various benchmarks, as observed in Tables and evaluations provided. Notably, models equipped with the proposed sink token consistently outperform their vanilla DLM counterparts, affirming the token's added stability and efficiency. These results are particularly pronounced across intermediate transformer layers where the sink token reliably harnesses attention without semantic interference (Figures 3).

Figure 3: Mean $L_2$ Norm per Layer: Sink vs Others (Model: 0.5B) demonstrating the low-norm characteristic of sink tokens.

Internal Analysis

Further analysis of the DLM architecture reveals that sink tokens naturally attract a significant proportion of attention mass, even when positioned variably within sequences. Trials conducted with zero-vector sink tokens also yield similar performance benefits, corroborating the hypothesis that the key advantage lies in standardizing norm states to mitigate information mixing rather than altering semantic flows.

Conclusion

The introduction of a stable sink token into Diffusion LLMs offers a practical solution for ameliorating attention instabilities inherent in current DLM designs. By ensuring minimal semantic interference and stable attention anchoring, the proposed method enhances model robustness and efficiency. This advancement is poised to inform future DLM developments, ushering in more reliable parallel text generation capabilities.