Log-Linear Attention

Published 5 Jun 2025 in cs.LG | (2506.04761v2)

Abstract: The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.

Abstract PDF Upgrade to Chat

Summary

The paper introduces log-linear attention, which replaces fixed hidden states with logarithmically growing states via Fenwick tree partitioning, achieving O(T log T) compute and O(log T) memory.
It demonstrates improved performance on long-range dependency tasks, with enhanced accuracy in synthetic benchmarks, language modeling, and retrieval settings.
The method leverages efficient parallel computation with custom Triton kernels, proving beneficial in architectures like Log-Linear Mamba-2 and Gated DeltaNet.

Log-Linear Attention: A Hierarchical Extension of Linear Attention for Efficient Long-Context Modeling

Motivation and Background

The quadratic compute and linear memory complexity of standard softmax attention in Transformers remains a significant bottleneck for long-context sequence modeling, despite advances in hardware-optimized implementations. Linear attention and state-space models (SSMs) address this by replacing the softmax kernel with a linear kernel, enabling linear-time, constant-memory sequence modeling and efficient parallelization via chunking. However, these models fundamentally rely on a fixed-size hidden state, which limits their expressiveness—particularly for tasks requiring associative recall or long-range dependency tracking. Empirically, linear attention variants often degrade in performance as context length increases, despite competitive results in short-context settings.

Log-Linear Attention: Core Concept

Log-linear attention is introduced as a principled middle ground between linear and softmax attention. The key innovation is to replace the fixed-size hidden state of linear attention with a set of hidden states whose size grows logarithmically with sequence length. This is achieved by partitioning the input sequence into hierarchical, power-of-two-sized segments using a Fenwick tree (binary indexed tree) structure. Each query attends to a logarithmic number of hidden states, each summarizing a different temporal scale of the past context. This design enables:

$\mathcal{O}(T \log T)$ compute and $\mathcal{O}(\log T)$ memory during inference, with efficient parallelization across sequence length during training.
Multi-scale context modeling, where recent tokens are represented at finer granularity and distant tokens are compressed, reflecting a hierarchical inductive bias.

The log-linear attention mechanism is general and can be applied to any linear attention variant whose masking matrix admits efficient chunkwise-parallel computation.

Formalization and Algorithmic Structure

Fenwick Tree Partitioning

For each query at position $t$ , the prefix $[0, t)$ is partitioned into up to $L = \lceil \log_2 t \rceil + 1$ disjoint buckets, each corresponding to a different temporal scale. The partitioning is guided by the least significant set bit (LSSB) in the binary representation of $t$ , ensuring that recent tokens are assigned to finer buckets.

The output at position $t$ is computed as:

$h_t = \sum_{\ell=0}^{L-1} \lambda_t^{(\ell)} q_t^\top \left( \sum_{s \in \mathcal{B}_t^{(\ell)}} k_s v_s^\top \right)$

where $\lambda_t^{(\ell)}$ are data-dependent, per-level weights parameterized as functions of the input.

Parallel Formulation

To leverage hardware acceleration, the computation is reformulated as a matmul-friendly operation:

$H = (Q K^\top) \odot \mathcal{H}$

where $\mathcal{H}$ is a lower-triangular, hierarchical matrix encoding the Fenwick tree partitioning and per-level weights. This matrix exhibits a quasi-hierarchical (HODLR-type) structure, enabling efficient $\mathcal{O}(T \log T)$ parallel algorithms.

Memory-Efficient Decoding

During autoregressive decoding, only $\mathcal{O}(\log T)$ hidden states need to be maintained, each corresponding to a hierarchical bucket. The update rule for the hidden states at each timestep is derived from the Fenwick tree recurrence, ensuring logarithmic memory and update time.

Chunkwise Parallel Training

Training employs a chunkwise-parallel scan algorithm, decomposing the hierarchical matrix into intra-chunk (dense, quadratic in chunk size) and inter-chunk (hierarchical, logarithmic in number of chunks) components. This enables efficient parallelization and hardware utilization, with a custom Triton kernel implementation outperforming FlashAttention-2 at long sequence lengths.

Application to Mamba-2 and Gated DeltaNet

Log-linear attention is instantiated in two recent architectures:

Log-Linear Mamba-2: Extends Mamba-2 by composing its sequentially semi-separable (SSS) attention mask with the log-linear hierarchical mask, preserving the original gating structure.
Log-Linear Gated DeltaNet: Similarly extends Gated DeltaNet, which uses identity-plus-rank-one transition matrices, by applying the log-linear hierarchical mask.

Both variants inherit the efficient chunkwise-parallel primitives of their linear counterparts, with only minor parameter overhead (e.g., <3% for Mamba-2).

Empirical Results

Synthetic Benchmarks

On the multi-query associative recall (MQAR) task, log-linear DeltaNet maintains high accuracy as sequence length increases, while standard DeltaNet degrades significantly. Softmax attention remains the upper bound.

Language Modeling

On academic-scale language modeling (50B tokens, 16K context), log-linear variants of Mamba-2 and Gated DeltaNet match or outperform their linear counterparts in perplexity and zero-shot reasoning tasks, with Gated DeltaNet showing more consistent gains.

Long-Context and Retrieval Benchmarks

Per-position loss: Log-linear variants exhibit lower loss across token positions, indicating improved long-range context utilization.
Needle-In-A-Haystack (NIAH): Log-linear Mamba-2 and Gated DeltaNet outperform their linear baselines on most retrieval metrics, especially in multi-needle settings.
In-context retrieval and LongBench: Log-linear Gated DeltaNet consistently matches or outperforms its linear baseline across a range of real-world, recall-intensive tasks.

Throughput and Implementation

Custom Triton kernels for log-linear Mamba-2 achieve higher throughput than FlashAttention-2 at long sequence lengths, with the main engineering complexity arising from intra-chunk operations and gradient computation for the additional $\lambda$ terms.

Theoretical and Practical Implications

Log-linear attention bridges the gap between the efficiency of linear attention and the expressiveness of softmax attention by introducing a logarithmically growing state. The hierarchical matrix perspective connects this work to structured matrix theory, enabling efficient algorithms for both training and inference. The Fenwick tree partitioning introduces an inductive bias favoring recent context, which is well-suited for many sequence modeling tasks but may not be optimal universally.

The framework is general and can be extended to more expressive linear RNNs by allowing matrix-valued state transitions and higher-order tensor representations, though practical implementation may be more complex.

Limitations and Future Directions

Performance gap to Transformers: Despite improvements, a significant gap remains compared to full softmax attention on some benchmarks.
Engineering complexity: The implementation, especially for intra-chunk operations and backpropagation, is more involved than standard linear attention.
Inductive bias: The hierarchical compression of distant context may not be optimal for all tasks; exploring alternative partitioning schemes or more flexible hierarchical structures is a promising direction.
Parameterization of $\lambda$ : Further exploration of the optimal parameterization and learning dynamics of the per-level weights could yield additional gains.

Conclusion

Log-linear attention provides a theoretically principled and practically efficient extension of linear attention, enabling scalable long-context modeling with improved expressiveness. By leveraging hierarchical matrix structures and efficient parallel algorithms, it offers a compelling alternative for sequence modeling tasks where both efficiency and long-range dependency modeling are critical. The framework is broadly applicable to existing linear attention and SSM architectures, and its hierarchical inductive bias opens avenues for further research in structured sequence modeling.

Markdown