Linear Recurrent Neural Networks

Updated 25 January 2026

Linear recurrent neural networks are models where hidden states evolve linearly, enabling analytic tractability and efficient sequence processing.
Variants like diagonal and block-diagonal LRNNs offer closed-form state updates and improved performance on long-range temporal tasks.
LRNNs integrate hardware efficiency, sparsity, and attention-equivalent constructs to achieve robust, real-time, and scalable applications.

Linear recurrent neural networks (LRNNs) are a prominent class of sequence models in which the hidden state evolves linearly with respect to the previous hidden state and current input. Unlike traditional nonlinear RNNs that rely on gated and nonlinear activation mechanisms, LRNNs leverage the algebraic simplicity and analytic tractability of linear dynamics. Modern developments have expanded this concept with weight-space recurrences, diagonal and block-diagonal variants, sparsity-inducing schemes, parallelization strategies, and attention-equivalent constructs, establishing LRNNs as efficient, expressive, and increasingly competitive architectures for long-range temporal modeling.

1. Mathematical Formulations and Key LRNN Variants

An LRNN typically obeys a recurrence of the form $h_t = W h_{t-1} + U x_t + b$ for hidden state $h_t$ , input $x_t$ , recurrent weights $W$ , input projection $U$ , and bias $b$ (Stolzenburg et al., 2018). When $W$ is diagonal, the recurrence simplifies to elementwise updates, and the system may be interpreted as a high-order linear filter over the input stream—admitting closed-form solutions for the hidden state (François et al., 13 Feb 2025).

Recent advances have introduced weight-space linear recurrences, notably WARP, where the hidden state $\theta_t$ is the entire flattened weight vector of a distinct "root network", itself evolving linearly:

$\theta_{t} = A \theta_{t-1} + B \Delta x_{t}, \quad \theta_0 = \phi(x_0), \quad y_t = f_{\theta_t}(\tau)$

Here, $A$ and $B$ are transition matrices, $\Delta x_{t} = x_{t} - x_{t-1}$ , and $f_{\theta_t}$ is a small neural-net decoder indexed by normalized time $\tau$ (Nzoyem et al., 1 Jun 2025).

Variants such as block-diagonal LRNNs admit richer transitions: $x_k = A_k x_{k-1} + B u_k$ , with $A_k$ block-diagonal and input-dependent, enabling the simulation of finite automata for regular language tasks (Fan et al., 2023).

2. Computational Properties and Parallelization

A central computational property of LRNNs is the capacity for parallelization. The linear recurrence admits mapping to a prefix-scan (parallel scan) operator, enabling $O(\log T)$ parallel depth computations for sequence length $T$ (Martin et al., 2017). This approach has enabled up to 9x faster training and inference for long sequences, a throughput advantage especially in architectures with diagonal or surrogate linear recurrences.

In the diagonal setting, e.g., SSM-based LRNNs, recurrence is element-wise and perfectly suited for embarrassingly parallel GPU kernels. The "linear surrogate RNN" framework further decouples linear state evolution from nonlinear output, allowing complex architectures (e.g., GILR-LSTM) to benefit from parallel scan primitives (Martin et al., 2017).

3. Expressivity, Universal Approximation, and Theoretical Limits

Linear RNNs possess universal approximation properties: given sufficient hidden dimension $N$ , any time-dependent function $f(t)$ over a window of length $n$ can be realized exactly by linear activation, with weights trained via linear regression (Stolzenburg et al., 2018). For continuous-time linear RNNs, arbitrary linear functionals can be approximated with $m = O(T \log \frac{1}{\varepsilon})$ hidden units for horizon $T$ and error tolerance $\varepsilon$ (Li et al., 2020).

However, the curse of memory manifests: as the memory horizon $K$ increases, the number of hidden units $S$ needed for sharp recall grows as $S \sim K$ , with irreducible time-frequency tradeoff (uncertainty principle): to recall the input $K$ steps ago, filtering "blur" is restricted to width $K/S$ (François et al., 13 Feb 2025). In the worst case, exponential width is required for uniform error under long-term weightings (Li et al., 2020).

Expressivity further depends on structural design: vanilla LRNNs (fixed or diagonal $A$ ) cannot simulate non-commutative or modular transitions needed for regular-language reasoning or arithmetic, unless full input-dependent or block-diagonal recurrences are employed (Fan et al., 2023). Recent block-diagonal input-dependent LRNNs overcome these limitations, achieving state-of-the-art length-extrapolation on complex sequence tasks.

4. Training Dynamics, Adaptation, and Robustness

LRNNs are amenable to closed-form training for output parameters via linear regression or least-squares, bypassing backpropagation (BPTT) for these components. For models in WARP's weight-space paradigm, recurrent parameters $A$ , $B$ , and the initializer $\phi$ are trained by backpropagation through time using either MSE or NLL losses (Nzoyem et al., 1 Jun 2025).

A distinguishing feature is test-time adaptation: certain architectures (e.g., WARP) support gradient-free online update, where the hidden state (parameterized weights of the root network) is updated by the recurrence without further gradient steps, enabling zero-shot adaptation (Nzoyem et al., 1 Jun 2025).

For continuous-time stochastic LRNNs, robustness and generalization are enhanced via additive noise, leading to PAC-style generalization bounds. Convexity-like smoothing induced by noise buffers the model against mislabels and input perturbations, with empirical risk minimization yielding the best-in-class hypothesis via closed-form gradients (Bartolomaeus et al., 2021).

5. Hardware Efficiency and Sparsity

LRNNs in diagonal form are particularly suitable for hardware deployment due to constant memory and compute per token. Unstructured sparsity induces further compression: iterative magnitude pruning (IMP) with Erdős–Rényi–Kernel allocation yields networks with 90% sparsity, achieving 2x less compute and 36% less memory at iso-accuracy for audio denoising tasks (Pierro et al., 3 Feb 2025).

On neuromorphic platforms, e.g., Intel Loihi 2, fixed-point quantized sparse LRNNs translate model compression into massive latency and energy gains: 42x lower latency and 149x lower energy per token compared to dense baselines on edge GPUs (Pierro et al., 3 Feb 2025).

Recent RL architectures based on trace units (RTUs) extend diagonal LRNNs for efficient real-time recurrent learning (RTRL), achieving linear cost per step, stable performance even at high hidden dimensions, and superior sample efficiency compared to gated RNNs (Elelimy et al., 2024).

6. LRNNs and Attention Mechanisms

Gated linear recurrent architectures with paired multiplicative gates can exactly reproduce linear self-attention. By appropriately setting input and output gates, LRNNs integrate key-value and query representations, matching the behavior of a causal attention accumulator (Zucchet et al., 2023). Empirical findings demonstrate that standard gradient descent reliably discovers this construction in practice, with LRNNs learning one-step gradient descent in in-context learning scenarios.

While linear self-attention is strictly covered, expressivity of softmax-based attention is unattainable except via augmentation. Nonetheless, for causal and streaming applications, LRNNs provide memory-efficient, sequential alternatives, with attention-style capabilities anchored in bilinear gate design and diagonal recurrences.

7. Empirical Performance and Applications

Weight-space LRNNs (WARP) match or surpass leading baselines (GRU, LSTM, S4) in sequential image completion (MNIST MSE=0.042 vs S4=0.049) and set new state-of-the-art results in time-series classification and long-range forecasting (Nzoyem et al., 1 Jun 2025).

On regular language extrapolation tasks (Sum, EvenPair, Modular Arithmetic), block-diagonal input-dependent LRNNs are the only LRNNs to generalize perfectly to arbitrarily long sequences (Fan et al., 2023).

Sparse diagonal LRNNs (S5 variants) set performance–efficiency Pareto optima for edge deployment in audio denoising (Pierro et al., 3 Feb 2025). RTUs achieve near-oracle returns and lower sample complexity in partially observable reinforcement learning (Elelimy et al., 2024). Stochastic linear RNNs achieve robust path classification on real and synthetic data, with explicit PAC bounds validating generalization properties (Bartolomaeus et al., 2021). LRNNs perform well in practical time-series forecasting (MSO, RoboCup, DAX stock); after spectral pruning, dramatic model-size reductions are possible (Stolzenburg et al., 2018).

8. Limitations and Future Directions

LRNNs confront fundamental limits in contexts demanding high nonlinearity or super-exponential memory growth. The curse of memory necessitates exponential hidden state size for sharp recall of long-context signals (Li et al., 2020, François et al., 13 Feb 2025). Pure linear recurrence is insufficient for complex reasoning unless combinatorial or input-dependent block structures are adopted (Fan et al., 2023). Vanishing gradients persist for long sequence lengths, necessitating architectural mitigation.

Potential extensions include low-rank, diagonal-plus-low-rank, or structured recurrences, kernel-based parameterizations decoupling explicit transitions, enhanced root network decoders in weight-space RNNs, and theoretical investigation into generalization and stability in high-dimensional weight trajectories (Nzoyem et al., 1 Jun 2025). Hardware-aware design leveraging sparsity, quantization, and neuromorphic adaptability holds promise for real-time and embedded applications (Pierro et al., 3 Feb 2025).

LRNNs are poised to integrate further with attention layers, nonlinear surrogates, and block-diagonal mechanisms, establishing themselves as a core analytic and engineering primitive for deep sequence modeling, continual learning, and physics-informed forecasting.