HyperMLP: Dynamic MLP-Based Sequence Modeling

Updated 16 February 2026

The paper introduces HyperMLP, a framework that reformulates autoregressive attention into a two-layer MLP with dynamic, context-conditioned weights derived from sequence history.
It employs a novel lag layout and a gated linear unit variant (HyperGLU) to ensure autoregressive consistency and effective temporal mixing.
Empirical results demonstrate that HyperMLP/HyperGLU outperform softmax and ReLU attentions, reducing next-token loss by approximately 0.06 under matched parameter budgets.

The HyperMLP framework is a sequence modeling architecture that reconceptualizes autoregressive attention as a dynamic two-layer multilayer perceptron (MLP), wherein the weights are context-dependent and instantiated directly from the sequence history. HyperMLP and its Gated Linear Unit (GLU) variant, HyperGLU, unify feature-space and sequence-space mixing through dynamic parametrization, and introduce a lag layout to ensure autoregressive (AR) consistency. This yields a strict generalization of standard softmax and ReLU attention, with demonstrable empirical improvements under matched parameter budgets across controlled memory and language modeling benchmarks (Lu et al., 13 Feb 2026).

1. Architectural Reformulation: Attention as Dynamic MLP

In canonical autoregressive Transformer attention, the output for a token at timestep $t$ is computed by softmax-weighted query-key lookups over the context prefix: $q_t = x_t W^q, \quad K = X_{<t} W^k, \quad V = X_{<t} W^v, \quad O_t = \mathrm{softmax}(q_t K^\top)V W^o,$ where $X_{<t} \in \mathbb{R}^{(t-1) \times d}$ is the prefix of $d$ -dimensional embeddings.

HyperMLP observes that this operation is algebraically equivalent to a two-layer MLP with context-dependent weights: $W_\mathrm{MLP}^{(1)}(X_{<t}) = L^{(1)}(x_t)(X_{<t})^\top R^{(1)}(x_t) \in \mathbb{R}^{d \times (t-1)},$

$W_\mathrm{MLP}^{(2)}(X_{<t}) = R^{(2)\top}(x_t)(X_{<t})^\top L^{(2)\top}(x_t) \in \mathbb{R}^{(t-1) \times d},$

with the hidden and output layers given by

$h_t = x_t W_\mathrm{MLP}^{(1)}(X_{<t}) \in \mathbb{R}^{1 \times (t-1)}, \quad O_t = o(h_t) W_\mathrm{MLP}^{(2)}(X_{<t}) \in \mathbb{R}^{1 \times d},$

where $o(\cdot)$ is a pointwise nonlinearity, such as ReLU or GLU.

2. Context-Conditioned Selection and the Activation Structure

HyperMLP replaces the traditional probabilistic normalization (softmax) with ReLU-like gating combined with L₂ normalization: $o(h) = \mathrm{ReLU}\bigl(\mathrm{L2Norm}_{t-1}(h)\bigr), \quad \mathrm{L2Norm}_{t-1}(h) = \frac{h}{\|h\|_2 + \epsilon}.$ The gates select a subset of $t-1$ context slots based on positive pre-activations, implementing input-conditioned selection from the context-dependent memory pool, rather than normalizing to a probability simplex as in softmax attention. The GLU-based variant, HyperGLU, further splits the gates and scaling branches: $h_\mathrm{gate} = x W_\mathrm{gate}^{(1)}(X_{<t}), \quad h_\mathrm{scale} = x W_\mathrm{scale}^{(1)}(X_{<t}),$

$a_t = \mathrm{Softplus}(h_\mathrm{scale}) \odot \mathrm{ReLU}(\mathrm{L2Norm}(h_\mathrm{gate})), \quad O_t = a_t W^{(2)}(X_{<t}).$

The result is decoupled gating (active set) and magnitude modulation (slot strength), with the active-set determined by $\{h_\mathrm{gate} > 0\}$ .

3. Lag Layout, Autoregressive Consistency, and Sequence Mixing

HyperMLP introduces a lag (reverse-offset) layout, storing context prefixes as $X_{t:1} = [x_t; x_{t-1}; \ldots; x_1]$ , aligning temporal mixing with autoregressive semantics. Sequence mixing operators $R^{(m)}(x)$ are parameterized so that their extension to larger context blocks appends rows corresponding to older tokens without affecting the next-token computation,

$O_T(x, X_{T:1}) = O_t(x, X_{t:1}),$

given that the L₂ normalization is padding-invariant. This guarantees proper AR consistency: readouts are unaffected by far-past context extensions. Empirical ablations confirm that without lag layout, performance dramatically degrades (Lu et al., 13 Feb 2026).

Mixing in both feature and sequence spaces is accomplished using dynamically parametrized low-rank (DPLR) operators, applied symmetrically both pre- and post-activation for maximum benefit. Two-sided temporal mixing outperforms one-sided mixing.

4. Theoretical Expressivity and Parameter Budget Considerations

HyperMLP offers strict expressivity generalization over standard attention. In residual two-layer blocks $f(x) = x + (xW_1)W_2$ , control over the update and conditioning subspaces depends on whether $W_1$ or $W_2$ is rank-compressed. To preserve update expressivity under a fixed budget, it is favorable to shrink the QK-path (first layer) and pay for additional mixing capability, rather than compress the VO-path (second layer).

Token-wise ReLU attention implements polyhedral gating boundaries, while the sequence-dependent mixing of HyperMLP warps these into curved hypersurfaces, thereby enlarging the function class (Theorem C.13, Prop. C.14). HyperMLP can realize gating boundaries unattainable by static feature projections even when parameter budgets are matched. This theoretical advantage is formalized in Proposition I.2 of the source (Lu et al., 13 Feb 2026).

5. Empirical Performance and Comparative Analysis

HyperMLP and HyperGLU outperform strong softmax- and ReLU-attention baselines under matched parameterizations on the Memory-Augmented Dataset (MAD) suite and large-scale language modeling (NanoGPT and Open LLM Leaderboard). In controlled studies:

ReLU attention with enhanced feature-side projections matches softmax attention performance.
Two-sided DPLR mixing with lag layout substantially increases MAD average scores (e.g., ≈66 → 81).
Under strict budget constraints (e.g., 340 M and 1.3 B parameter Transformers, context length 4096), HyperMLP/HyperGLU reduce next-token loss by approximately 0.06 compared to ReLU-attn, and achieve superior leaderboard rankings.
Ablation studies show lag layout is essential for AR consistency; two-sided mixing and QK compression for mixing are consistently optimal.

A summary table in (Lu et al., 13 Feb 2026) consolidates these comparisons across ≈50 model variants on both MAD and NanoGPT.

6. Implementation and Practical Guidelines

Key implementation aspects:

HyperMLP/GLU heads use $n_\text{head}=2$ to cap DPLR overhead.
Head ranks: $d_{QK} = d/(8n_\text{head})$ , $d_{VO} = d/n_\text{head}$ .
Sequence mixing rank per operator: $r_s = 16$ ; per-head state/time complexity: $O(t r_s)$ .
L₂ norm stabilization parameter: $\epsilon = 10^{-12}$ .
Optionally apply a kernel-4 convolution in QK for lightweight local mixing.
Training utilizes chunked row-blocking (chunk ≤ 512) and offset/skew layouts to maintain contiguous causal prefixes; DPLR contraction with fused epilogue mitigates memory traffic.
At inference, only the last $r_s$ vectors per head are maintained ( $O(r_s)$ state), with per-step complexity $O(t d_{QK} + t d_{VO} + t r_s)$ (quadratic scaling for $r_s \ll d_{QK}, d_{VO}$ ).

7. Summary and Research Impact

The HyperMLP framework fundamentally reinterprets attention as dynamic two-layer MLP computation over an ever-growing context, enabling global slot selection and mixing through DPLR operators and lag-aligned temporal semantics. By realigning parameter budget allocation and enhancing the expressivity via non-static mixing operations, HyperMLP/HyperGLU generalize standard attention to richer gating geometries. Under identical resource constraints, these architectures demonstrate consistent empirical improvements on synthetic memory and real-world language modeling, representing a significant reformulation of sequence modeling practice (Lu et al., 13 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

HyperMLP: An Integrated Perspective for Sequence Modeling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HyperMLP Framework.