HyperMLP: An Integrated Perspective for Sequence Modeling

Published 13 Feb 2026 in cs.LG, cs.AI, cs.CL, and stat.ML | (2602.12601v1)

Abstract: Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper proposes reconceptualizing autoregressive attention as dynamic two-layer MLPs that replace softmax normalization with context-conditioned feature mixing.
It demonstrates that explicit temporal and feature mixing significantly boosts model expressivity and improves performance on language benchmarks.
The study provides theoretical and empirical evidence that preserving readout rank and employing lag layouts are crucial for maintaining autoregressive consistency and optimal routing.

HyperMLP: Reconceptualizing Autoregressive Attention as Dynamic Two-Layer MLPs

Unified View of Autoregressive Attention

The HyperMLP framework proposes a reconceptualization of self-attention in sequence models. Departing from the traditional probability simplex and softmax-based interpretation, the authors demonstrate that an autoregressive attention head can be equivalently formulated as a context-conditioned, dynamic two-layer MLP whose hidden width grows with the past context length. In this formulation, the "attention scores" are treated not as normalized positional weights but as hidden activations over a dynamic context-dependent representation. The classical queries, keys, and values are thereby demoted to intermediate calculations within a learned feature and sequence mixing operator.

This unified "attention-as-MLP" perspective unlocks a seamless integration of feature and temporal (sequence) mixing, leveraging familiar nonlinearities such as ReLU and GLU in lieu of softmax. The result is a model that approaches context summarization and routing as a composite of input-conditioned subnetwork selection and context-wide gated memory access, circumventing the probabilistic constraints historically attached to attention.

Theoretical Analysis of Expressivity and Routing Geometry

A major theoretical contribution of the paper is the rigorous characterization of the functional class induced by this dynamic two-layer MLP view. The authors show that when the sequence mixing matrices are static, the resulting gating volumes correspond to intersections of polyhedral regions, much like conventional ReLU MLPs. However, when these mixing operators are instead input-dependent and low-rank, the gating geometry becomes a smooth, warped partition of the input space, strictly generalizing the polyhedral case. This warped routing unlocks representation of functions whose decision boundaries cannot be captured by any finite union of hyperplanes—a radical expansion over classical attention operator function classes.

Moreover, the analysis justifies the use of context length in an ever-growing hidden dimension: each autoregressive step adds further width to the dynamic memory pool, decoupling expressivity gains from rigid positional or probability-encoding constraints. The lag/reverse-offset layout is shown to be essential to preserve AR (autoregressive) extension invariance, ensuring computational consistency and correctness as context grows.

The modeling perspective also gives formal justification to several previously empirical design rules in attention modules, such as the finding that:

VO-side (readout) rank should be preserved in parameter-constrained settings, as compressing the readout directly restricts the possible update subspace of the residual path.
Low-rank adapters (e.g., LoRA) and output gating are most effective on the readout side.
Incorporating explicit temporal (sequence) mixing before/after activation meaningfully expands function class capacity, even at matched parameter budgets.

HyperMLP/HyperGLU Block Design

Building on the dynamic MLP foundation, the authors propose the HyperMLP and its variant HyperGLU. Both heads instantiate their weights via input-conditioned low-rank parameterizations across both feature and sequence axes. The mixing operators $R^{(1)}(x_t)$ and $R^{(2)}(x_t)$ operate in reverse-offset lag order, enabling the canonical AR truncation invariance as theoretically mandated.

For sequence mixing, both diagonal-plus-low-rank (DPLR) forms and optional lightweight convolutions are explored; these components ensure that the memory slot basis can adapt flexibly to input content and context. The GLU-style hidden activation further decouples active-set selection (routing) from slot weighting, improving modulation capability and learning dynamics.

Notably, these blocks can be implemented with only modest $O(T^2r)$ overhead—where $r$ is the low-rank dimension—over standard quadratic attention, and can reuse modern blockwise kernel and memory layouts. However, the new structure cannot directly piggyback on the highly optimized FlashAttention backend.

Empirical Performance: Benchmarks and Insights

The paper presents a thorough empirical study, including both mechanistic synthetic diagnostics (MAD suite) and standard large-scale language modeling tasks (NanoGPT/OpenWebText2 and Fine Web-Edu, with evaluation on the Open LLM Leaderboard and established QA/MC benchmarks).

Key findings are:

Softmax normalization is not required for strong performance. ReLU-based attention, with appropriate context mixing, matches or exceeds historical softmax baselines.
Explicit temporal/sequence-space mixing is the dominant driver of observed improvements. Large, often double-digit increases in recall, selective copying, and fuzzy match tasks are observed upon enabling learned sequence mixing.
The lag layout is critical: Temporal mixing without canonical reverse-offset coordination fails to maintain AR consistency, producing sharp performance drops.
Two-sided mixing (routing and readout) yields additive gains over one-sided approaches.
Budget allocation strongly favors preserving readout (VO) rank; compressing the addressing/routing (QK) side is preferable under fixed parameter budgets.
HyperGLU variant achieves the lowest LM loss and best aggregate benchmark performance under equal-sized parameter settings.

These results are established both in compute-constrained diagnostic settings and at larger, practical model scales, consistently placing HyperMLP/HyperGLU at or near the top of competitive model leaderboards.

Implications, Limitations, and Future Directions

This work challenges the traditional framing of attention as a probabilistic, probability-simplex-normalized lookup, instead reorienting sequence modeling toward context-instantiated dynamic function composition. The theoretical and empirical results jointly reinforce the view that learned temporal and feature mixing in a dynamic-MLP architecture strictly enlarges the expressive envelope of sequence models—enabling superior handling of noisy/fuzzy recall, compression, and complex context-dependent routing.

On the practical side, HyperMLP and HyperGLU deliver consistent capability improvements at fixed compute, and the modular design is directly compatible with Transformer-style residual architectures and standard context-window mechanisms.

However, current implementation optimizations lag behind attention kernels such as FlashAttention, and ablation at frontier model scales remains unexamined due to resource constraints. Further, while the mechanism primarily improves flexible context usage rather than parametric (memorization) capacity, application-dependent concerns such as privacy, bias, and robustness remain as in all powerful sequence models.

Future directions include rigorous scaling experiments at the largest model sizes, hardware-level kernel optimization for fast context mixing, and deeper exploration of the new model's inductive biases in areas such as in-context learning, generalization, and adaptation.

Conclusion

HyperMLP advances an integrated, theoretically grounded rethinking of sequence modeling, casting autoregressive attention as dynamic two-layer MLPs with growing context-dependent hidden representations. The resulting models, HyperMLP and HyperGLU, harness input-conditioned sequence and feature mixing—achieving expressivity and empirical performance surpassing strong softmax-attention baselines at matched compute. This paradigm suggests a promising avenue for further development in efficient and flexible sequence models.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper suggests a simpler way to think about how Transformers process sequences like text. Instead of seeing “attention” as choosing past words using probabilities, the authors show it can be viewed as a tiny two-layer calculator (an MLP) that changes its settings on the fly based on the words you’ve already seen. Using this idea, they build new attention-like blocks called HyperMLP and HyperGLU that make models smarter and more efficient under the same size limits.

Think of it like this: when writing a story word by word, the model keeps a growing “notebook” of past words. Traditional attention treats this notebook like a list it picks from using probabilities. HyperMLP/HyperGLU treat it like a toolbox of switches and sliders that the model learns to rearrange and use depending on the current word and the whole notebook.

What questions are they trying to answer?

The paper asks:

Can we replace probability-based attention (softmax) with a simpler, more flexible “gating” style and still do well—or even better?
Can we make attention more expressive by learning how to mix information across the sequence itself, not just across features?
How do design choices in attention (like where to put gates or compress ranks) actually affect a model’s abilities?
Do these new blocks beat strong baselines in practice when the number of parameters is kept the same?

How did they approach the problem?

The authors reframe “self-attention” as a dynamic two-layer MLP:

An MLP (multi-layer perceptron) is a basic neural network with layers that apply linear transforms and simple activations like ReLU. ReLU is like a switch: it keeps positive numbers and zeroes out negatives.
In standard attention, the model computes “scores” between the current word and each past word, then uses softmax to turn those scores into probabilities and averages the past information.
In their view, those scores are not probabilities; they are the hidden layer of a two-layer MLP whose “weights” are created from the context (the history of words). The model uses gates (ReLU or GLU) to decide which parts of the context to activate, like turning on some switches based on the input.

They then design two new blocks:

HyperMLP:
- Learns to mix information across both feature space (which describes what a word is like) and sequence space (which describes where it is in the history), using efficient low-rank factors. “Low-rank” means they keep only the most important directions, like summarizing a big photo into a small but useful version.
- Uses a “lag” layout: it treats the history in reverse order (newest to oldest). This keeps behavior consistent as the sequence grows, which matches how models generate text one token at a time.
- Normalizes with simple L2 scaling (like adjusting volume so the signal doesn’t blow up) rather than probability normalization.
HyperGLU:
- Similar to HyperMLP but replaces the ReLU with GLU, which splits the hidden signal into two parts: one decides “which slots to use” (selection), and the other controls “how strongly to use them” (scale). Imagine one knob chooses the tools, and a second knob sets their power.

Everyday analogies for technical terms:

Autoregressive: writing text one word at a time, only seeing what you already wrote.
Attention scores: a set of numbers measuring relevance; here treated as switch settings, not probabilities.
Sequence mixing: like rearranging and blending your notebook’s pages to create new helpful summaries before reading from them.
Low-rank and diagonal-plus-low-rank: efficient ways to mix information without handling huge matrices, like using a small set of levers rather than a giant control panel.
KV cache: a memory of past processed features the model can quickly access.

They test their ideas with:

Controlled design studies on a diagnostic benchmark (MAD) and a small GPT-style trainer (NanoGPT), carefully matching parameter budgets to isolate effects.
Full language-model training at two scales (340M and 1.3B parameters) on large datasets, evaluated on many public benchmarks.

What did they find, and why does it matter?

Key findings:

Softmax probabilities aren’t essential. Replacing softmax with ReLU-style gates often matches or beats softmax under the same budget when paired with good feature handling.
Mixing along the sequence dimension is a big win. Learning how to combine information across past positions gives strong improvements in tasks that need retrieval and copying, like finding and reusing earlier content.
The “lag” layout (reverse order) is necessary. Without it, temporal mixing breaks the autoregressive logic and performance crashes. With it, results are much better and consistent as the sequence length changes.
Mixing on both sides helps. Applying sequence mixing on the “routing” side (which slots are active) and the “readout” side (how to combine them) is better than mixing on just one.
Budget matters: keep more rank on the readout side. Compressing the readout side hurts update diversity more than compressing the routing side. This explains why some popular fine-tuning tricks work better on certain projections.
HyperGLU tends to be the best variant. Splitting selection from magnitude makes models more robust and leads to the strongest overall results in their tests.

In numbers (high-level takeaways):

On diagnostic tasks (MAD), adding sequence mixing with lag layout boosts performance a lot, especially on retrieval/copy tasks.
In NanoGPT training, models with HyperMLP/HyperGLU reach lower losses faster under the same training setup.
On broader language-model evaluations, HyperMLP and especially HyperGLU consistently rank higher than strong softmax-attention and other modern sequence models at matched sizes.

What does this mean going forward?

Implications:

Rethinking attention as a dynamic two-layer MLP simplifies design and explains many empirical tips the community discovered by trial and error. It suggests focusing on learned gating and sequence mixing rather than strictly preserving probability distributions.
Under the same parameter and compute budget, you can get more capability by learning how to rearrange and gate the sequence memory. This is promising for building smaller, stronger, and more efficient models.
Engineering challenges remain. HyperMLP/HyperGLU introduce new sequence mixing steps that don’t yet plug into highly optimized kernels like FlashAttention, so matching raw speed needs further work.
Scaling up is the next step. The paper shows consistent gains at moderate sizes; testing at very large “frontier” scales could confirm the value of this approach across domains like language, vision, and speech.

In short: the paper turns attention into a flexible, learnable calculator that selects and mixes context in smarter ways. That shift delivers better performance without increasing model size, opening a practical path to more capable and efficient sequence models.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide follow‑up research.

Scaling to frontier sizes: Validate HyperMLP/HyperGLU at tens to hundreds of billions of parameters and ≥1T tokens, including convergence behavior, training stability, memory footprint, and whether gains persist or change with scale.
Kernel-level optimization: Design and benchmark fused GPU kernels for sequence mixing (e.g., FlashAttention-like implementations for DPLR mixers), quantify wall-clock speed, throughput, and memory vs. highly optimized attention.
Long-context performance: Empirically assess extension-consistency claims beyond 4K tokens (e.g., 32K–128K). Test on long-context benchmarks (LRA variants, long-document QA, needle-in-a-haystack, retrieval-heavy tasks) and report degradation curves.
Sequence-mixing rank selection: Systematically study how the DPLR rank rs affects expressivity, training dynamics, and compute across layers/heads; explore adaptive or data-driven rank schedules and per-layer heterogeneity.
Number of heads: The paper fixes nhead = 2 for efficiency. Characterize performance, stability, and budget trade-offs across typical head counts (e.g., 8–32), including head specialization/diversity and interactions with sequence mixing.
Activation and gating choices: Beyond ReLU/GLU, evaluate SiLU/GELU/leaky-ReLU and alternative GLU variants; isolate their effects on routing geometry, optimization, and generalization under identical budgets.
Normalization strategy: L2Norm over the hidden vector is assumed as a scalar rescaling. Compare RMSNorm/LayerNorm/affine-free variants applied pre/post mixing; analyze training stability, gradient conditioning, and empirical routing invariance to normalization changes.
Optimization dynamics: Provide a deeper analysis of gradient flow, gate saturation, margin dynamics, and curvature induced by xt-dependent mixers; identify failure modes (e.g., collapse without lag layout) and mitigation strategies.
KV-cache integration: Precisely characterize incremental decoding with DPLR mixing—cache states, update rules, memory growth, and latency. Benchmark streaming inference (batch size, context growth) and caching strategies across long sequences.
Modal transferability: Test HyperMLP/HyperGLU on vision (ViT), speech, reinforcement learning (Decision Transformer), and multi-modal tasks to verify the claimed unified perspective beyond language.
Benchmark breadth and robustness: Expand evaluation to coding (Humaneval/MBPP), reasoning/tool-use (GSM8K, ASDiv), instruction-following (AlpacaEval, MT-Bench), multilingual tasks, and out-of-distribution robustness; report statistical significance across seeds.
Energy and compute profiling: Quantify energy usage, FLOPs, memory bandwidth, and training/inference cost per token compared to softmax and other selective state-space models; assess carbon footprint and efficiency trade-offs.
Mechanistic interpretability: Operationalize the pool/activated-memory view—extract active slot sets, measure routing margins, visualize curved decision boundaries, and relate them to known transformer circuits (induction heads, copy mechanisms).
Formal expressivity guarantees: Provide universality or approximation theorems for dynamic two-layer MLPs with bounded feature and sequence ranks; bound function classes vs. softmax attention and linear attention under matched budgets.
Registers in practice: The paper argues registers enlarge the hidden pool; empirically test register count/placement and quantify gains vs. parameter cost, including interactions with DPLR mixing.
Budget asymmetry under fine-tuning: Extend QK vs. VO compression insights to parameter-efficient tuning (LoRA, adapters) and task-specific specialization; establish guidelines for where to place low-rank adapters and gates in HyperMLP/GLU.
Alternative sequence mixing designs: Compare DPLR with Toeplitz/conv kernels, low-discrepancy permutations, orthogonal cores (e.g., Lie-group parameterizations), and learned pooling/subsampling; analyze efficiency–expressivity trade-offs.
Positional semantics: Evaluate interactions with RoPE/relative position encodings and alternatives (ALiBi, T5-style). Determine when learned sequence mixing can replace or complement explicit positional encodings.
Subquadratic variants: Investigate whether sequence mixing can be structured to enable subquadratic training/inference (e.g., sparse/DPLR+convolution hybrids, low-rank plus local windows) while preserving HyperMLP expressivity.
Quantization and pruning: Study compatibility with 8-bit/4-bit training/inference and structured sparsity; measure how gates and sequence mixing affect quantization error and pruning sensitivity.
Transfer and continual learning: Test how context-instantiated weights adapt under domain shift and incremental data; analyze catastrophic forgetting and whether sequence mixing provides more robust in-context adaptation.
Layerwise design: Explore how mixing ranks, activations, and normalization should vary by depth; identify whether early vs. late layers benefit differently from sequence mixing and GLU routing.
Assumption sensitivity: Several proofs rely on scalar L2 normalization and padding invariance; probe empirical sensitivity to deviations (e.g., non-scalar norms, variable-length padding) and identify conditions needed for lag-layout invariance.
Data boundaries and segmentation: Examine how lag layout and sequence mixing behave across document boundaries, resets, and concatenated contexts; propose methods for boundary-aware mixing or slot masking.
Multi-head coordination: Investigate whether heads learn complementary sequence bases; develop diagnostics and regularizers for head diversity and cross-head routing interference.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete, deployable uses that leverage HyperMLP/HyperGLU’s dynamic two‑layer MLP view of attention, learned sequence mixing (DPLR), and GLU/ReLU routing under matched parameter budgets.

Drop‑in replacement heads for small–mid‑scale Transformers to boost retrieval/copy behaviors
- Sectors: software, education, content generation
- What: Replace softmax attention with HyperGLU/HyperMLP in autoregressive decoders for better in‑context recall, selective copy, and long‑range retrieval (as shown by MAD and LM benchmarks).
- Tools/workflows: PyTorch layer modules for “HyperGLUAttention,” config flag in training scripts; evaluation with MAD and Open LLM Leaderboard suites.
- Assumptions/dependencies: Current kernels are not as optimized as FlashAttention; best suited where slight latency overhead is acceptable; recommended nhead≈2 and lag (reverse‑offset) layout.
Fine‑tuning recipes that reallocate rank and adapters per theory
- Sectors: software, enterprise AI
- What: Apply “budget asymmetry” guidance—compress QK, preserve VO rank; attach LoRA/gates on VO side for parameter‑efficient adaptation; expect more update directions per parameter.
- Tools/workflows: PEFT plug‑ins (“VO‑LoRA”, VO‑side gating), training checklists codifying Proposition 2.6 and Theorem 2.5.
- Assumptions/dependencies: Autoregressive blocks; stable with L2 normalization of scores; monitor update‑subspace coverage.
Retrieval‑heavy assistants and code completion with stronger internal copy
- Sectors: software development, productivity
- What: Improve selective copy and fuzzy recall in IDE copilots, document assistants, meeting summarizers without changing RAG backends.
- Tools/workflows: Swap decoder attention blocks; unit tests focused on copy/recall; prompt templates that exploit longer prefixes.
- Assumptions/dependencies: Throughput may dip without fused kernels; evaluate under your latency SLOs.
Time‑series forecasting and anomaly detection with lag‑consistent mixing
- Sectors: retail (demand), AIOps (logs), energy (load), finance (signals)
- What: Use HyperMLP’s lag layout and sequence mixing to better combine nearby lags and long‑range patterns in AR forecasters.
- Tools/workflows: Torch forecasting stacks (e.g., PyTorch Forecasting) with HyperGLU decoders; backtesting pipelines.
- Assumptions/dependencies: AR formulation; choose DPLR rank rs and convolution options; validate truncation invariance under sliding windows.
Speech and translation decoders with improved AR routing
- Sectors: speech, localization
- What: Replace decoder self‑attention to sharpen long‑context alignment and robustness in streaming ASR and NMT.
- Tools/workflows: Integrate in ESPnet/Fairseq decoders; measure WER/BLEU at matched parameters.
- Assumptions/dependencies: Kernel efficiency may limit streaming latency on edge; start in server/batch or research settings.
Decision Transformer pipelines with better long‑range credit assignment
- Sectors: robotics, reinforcement learning
- What: Use HyperGLU decoders for sequence‑modeled returns/actions to improve retrieval of distant returns and sub‑trajectory patterns.
- Tools/workflows: Gym/DMControl experiments; ablations on GLU vs ReLU routing.
- Assumptions/dependencies: Sequential policy inference; offline RL datasets with long horizons.
Interpretability and eval workflows that avoid “attention as probability” pitfalls
- Sectors: safety, compliance, academia
- What: Replace attention‑map explanations with “activated slot tracing” (trace positive coordinates of ht) consistent with dynamic MLP routing.
- Tools/workflows: Logging hooks that record active slots and pool atoms; MAD‑style diagnostics for selection/copy.
- Assumptions/dependencies: Reframe governance docs—attention weights are gates, not calibrated probabilities.
Curriculum and methodology for teaching/analysis of attention as dynamic MLP
- Sectors: academia, training
- What: Use the three‑stage memory view, lag layout, and budget asymmetry to teach/diagnose Transformer behavior.
- Tools/workflows: Lab assignments reproducing MAD gains; visualization of warped routing vs polyhedral partitions.
- Assumptions/dependencies: Autoregressive context; small compute sufficient for didactic demos.
Batch/offline inference where quality>latency
- Sectors: data labeling, offline analytics
- What: Run HyperGLU models for offline summarization, labeling, or data augmentation to maximize quality per parameter.
- Tools/workflows: Batch generation pipelines; quality checkpoints at equal token budgets.
- Assumptions/dependencies: No real‑time constraints; throughput costs acceptable pending kernel optimizations.

Long‑Term Applications

These require further research, scaling, systems work, or ecosystem adoption before wide deployment.

Fused kernels and compiler support for DPLR sequence mixing
- Sectors: software infrastructure, semiconductors
- What: FlashAttention‑grade kernels (“FlashHyper”) that fuse L2Norm, GLU gating, and DPLR mixing; scheduler support in Triton/XLA.
- Tools/products: CUDA/Triton kernels; ONNX opset extensions; TVM autotuning.
- Assumptions/dependencies: Cross‑vendor GPU/NPU support; numerical stability and memory‑bandwidth tuning.
Frontier‑scale LLM pretraining with HyperGLU
- Sectors: cloud AI, foundation models
- What: Train multi‑billion parameter HyperGLU models on 0.5–2T tokens to validate scaling, long‑context ICL, and benchmark leadership.
- Tools/workflows: Megatron‑LM/DeepSpeed integrations; context length ≥32k; long‑context evals.
- Assumptions/dependencies: Significant compute; mixture‑of‑experts compatibility; kernel maturity to maintain throughput.
On‑device real‑time assistants via capability‑per‑parameter gains
- Sectors: mobile, embedded, IoT
- What: Combine smaller HyperGLU models with mobile NPUs to deliver higher quality at the same memory footprint.
- Tools/products: Android NNAPI/Core ML delegate for HyperGLU ops; quantization/weight sharing tuned for GLU routes.
- Assumptions/dependencies: Hardware kernels for DPLR; energy‑aware scheduling; further latency reductions.
Multimodal HyperGLU for vision, audio, and video
- Sectors: vision (ViT/segmentation), media understanding
- What: Apply learned sequence mixing across patches/frames to improve retrieval of distant frames and context‑wide slots.
- Tools/workflows: ViT backbones with HyperGLU token mixers; audiovisual transformers in ASR/video QA.
- Assumptions/dependencies: Adapt lag semantics to bidirectional or relative layouts; new positional schemes.
Memory‑augmented and long‑context systems without heavy external RAG
- Sectors: enterprise search, legal, scientific analysis
- What: Use “context‑wide slot” mixing to internalize retrieval within long prompts, reducing dependence on external indices.
- Tools/workflows: Chunking strategies that align with lag layout; dynamic slot introspection for cache management.
- Assumptions/dependencies: Efficient long‑context kernels; memory‑safe caching; evaluation on >128k tokens.
Robotics/control stacks with online AR consistency
- Sectors: robotics, industrial automation
- What: Controllers using HyperGLU for better sequence routing over lagged sensor histories; improved stability with extension consistency.
- Tools/workflows: ROS2 modules; sim‑to‑real pipelines with Decision Transformer variants.
- Assumptions/dependencies: Deterministic low‑latency kernels; safety/verification of warped routing.
Healthcare sequential modeling (EHR trajectories, ICU)
- Sectors: healthcare
- What: Predict outcomes by mixing temporal events (labs, meds) with learned lag‑aware slots; better utilization of long histories.
- Tools/workflows: De‑identified EHR pipelines; conformal risk controls; drift monitors.
- Assumptions/dependencies: Regulatory compliance; interpretability tooling tailored to activated‑slot view; domain validation.
Financial forecasting and execution
- Sectors: finance
- What: Apply lag‑consistent temporal mixing to price/volume microstructure and event sequences for improved signal extraction.
- Tools/workflows: Backtesting with walk‑forward validation; risk controls on non‑stationarity.
- Assumptions/dependencies: Strict latency budgets in HFT require kernels; robust regularization to avoid overfitting.
Distillation and compression strategies exploiting routing/readout asymmetry
- Sectors: model compression, edge AI
- What: Distill softmax‑attention teachers into HyperGLU students that keep VO rank while compressing QK; new pruning that preserves readout subspace.
- Tools/workflows: Head‑wise rank scheduling; subspace‑aware pruning; KD losses targeting active‑slot sets.
- Assumptions/dependencies: Teacher‑student infrastructure; careful stability tuning of GLU gates.
Standards and policy guidance on explainability and energy
- Sectors: policy, governance
- What: Update explainability frameworks to avoid treating attention scores as probabilities; promote evaluation based on active‑slot routing; encourage capability‑per‑token metrics for greener AI.
- Tools/workflows: Audit checklists; reporting formats for routing diagnostics; procurement guidance prioritizing smaller, more capable models.
- Assumptions/dependencies: Community consensus; regulator education; availability of standardized routing metrics.

Notes on feasibility and adoption

Best near‑term wins appear in research and smaller production systems where a modest throughput penalty is acceptable in exchange for quality gains per parameter.
Broad deployment hinges on systems work: fused kernels, memory‑efficient caching for DPLR mixing, and hardware support.
The approach is naturally aligned with autoregressive settings; bidirectional encoders may require adapted layouts and training objectives.

View Paper Prompt View All Prompts

Glossary

Activated set (routing): The subset of hidden coordinates selected by activation based on score signs during dynamic gating. "Routing is the active-set partition induced by the sign pattern of ht."
Autoregressive attention head: A single attention mechanism applied causally to past tokens, here reframed as a dynamic two-layer MLP. "an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history."
Autoregressive generation (AR): Sequence modeling where each output depends on previous outputs; training/inference respects causal order. "In autoregressive (AR) generation, attention enables parallel training and efficient inference:"
Autoregressive truncation invariance: The property that extending the far past does not change current outputs under a lag layout. "Lag layout: extension consistency implies AR truncation invariance"
Block-diagonal rotation (RoPE core): A structured feature-space transformation applying independent rotations per sequence channel. "inserting a block-diagonal (per sequence channel of X) rotation,"
Budget asymmetry (QK vs. VO): Unequal impact of allocating rank/parameters between first-layer (QK) and second-layer (VO) components. "Budget asymmetry in residual two-layer blocks"
Content-addressable lookup: Attention interpretation where queries retrieve values by matching against keys and weighting results. "a content-addressable lookup in which a query matches against keys, softmax converts the result- ing scores into a distribution over positions, and the output is an expectation-style read"
Depthwise convolution (KV-side): Channel-wise convolution applied to key/value projections to mix local temporal information. "we still use depthwise convolution in the default HyperMLP/HyperGLU,"
Diagonal-plus-low-rank (DPLR): A matrix parameterization combining a learned diagonal with a low-rank update for efficient sequence mixing. "Both are input-conditioned with low-rank or diagonal-plus- low-rank (DPLR) forms:"
Dynamic two-layer MLP: A two-layer perceptron whose effective weights depend on the current context, here aligning with attention. "(i) Dynamic two-layer MLP."
Fast-weight programming: Approaches that generate weights on the fly conditioned on recent inputs, related to dynamic maps. "classic fast-weight programming (Schmidhuber, 1992; Schlag et al., 2021)"
FlashAttention: An IO-aware, fused attention kernel for speed and memory efficiency; not directly reusable here. "cannot directly reuse existing efficient backends such as FlashAttention (Dao et al., 2022)."
GEMM operations: General matrix-matrix multiplications used for efficient batched linear algebra in implementations. "leverages efficient GEMM operations after com- pilation."
GLU (Gated Linear Unit): An activation that gates and modulates magnitudes, separating selection from scaling. "HyperGLU replaces the ReLU hidden activation by a GLU-style modulation."
HyperGLU: The proposed attention variant using GLU-style routing with learned sequence/feature mixing. "HyperGLU replaces the ReLU hidden activation by a GLU-style modulation."
HyperMLP: The proposed attention-as-MLP architecture with learned input-conditioned sequence/feature mixing. "we propose HyperMLP, which learns input-conditioned mixing in both feature space and sequence space"
Hyperplane arrangement: A partition of input space by linear boundaries induced by ReLU gates in static mixing. "this partition is the usual hyperplane arrangement of a two-layer ReLU map;"
In-context learning: The ability of models to perform tasks using context without parameter updates, linked to prefix access. "this prefix access supports long-range retrieval and is closely tied to in-context learning behaviors"
KV cache: Incrementally maintained key/value tensors storing past projections for efficient autoregressive attention. "an incrementally maintained KV cache."
Lag layout (reverse-offset): Ordering the prefix from newest to oldest to align sequence mixing with autoregressive semantics. "temporal mixing without the reverse- offset (lag) layout collapses."
Learnable registers: Extra trainable tokens appended to the sequence to expand the hidden pool/capacity. "Learnable registers enlarge the hidden pool"
Linear attention: Attention variant with linear normalization that removes gating, reading from the full pool. "Linear attention collapses routing/selection"
LoRA (Low-Rank Adaptation): A fine-tuning method adding low-rank adapters to weight matrices for parameter-efficient updates. "LoRA and gates are most parameter-efficient on the readout side (V/O)"
Low-rank parameterization: Factorizing large matrices into low-rank components to reduce parameters and control expressivity. "conditioned low-rank parameterizations for feature and sequence mixing"
L2 normalization (RMSNorm-like): Scaling vectors by their L2 norm (without affine terms) for stabilization instead of softmax. "with L2 normalization (similar to RMSNorm) instead of probability normalization"
Multi-head attention: Parallel attention heads whose outputs are combined, here as sums of dynamic MLPs. "multi-head attention is simply the sum of nhead parallel dynamic MLPs"
Probability-simplex constraint: The requirement that attention weights form a probability distribution via softmax. "the probability-simplex constraint may not be essential and can be restrictive:"
Readout (VO) side: The value/output transformation in attention corresponding to the second layer of the dynamic MLP. "output gating (Qiu et al., 2025) inserts a diagonal gate on the readout side,"
ReLU attention: Attention variant replacing softmax with normalized ReLU gating over scores. "we refer to the increasingly studied alternative that replaces softmax with a normalized ReLU-style map as ReLU attention"
Residual connections: Skip connections adding inputs to outputs of blocks to stabilize training and support retrieval. "Stacked with residual connections, this prefix access supports long-range retrieval"
RoPE (Rotary Positional Embeddings): Position encoding via rotations in feature space to encode relative positions. "With RoPE (Su et al., 2024), the parameterization of W(1) MLP (X) is altered by inserting a block-diagonal (per sequence channel of X) rotation,"
Scaling laws: Empirical relations between model size, data, compute, and performance guiding training regimes. "The empirical "scaling laws" have further pushed models toward larger parameter size training"
Sequence mixing: Learnable transformations along the sequence dimension to form context-wide slots/bases. "we learn explicit sequence mixing to relax fixed po- sitional coordinates,"
Softmax attention: The standard attention using softmax-normalized scores as probabilities over positions. "softmax-attention baselines"
Teacher forcing: Training regime feeding ground-truth tokens during sequence model training over full contexts. "Over length-T teacher forcing, this yields a total O(T2rs) overhead,"
Toeplitz matrices: Structured matrices representing local convolutions with constant diagonals across offsets. "Local Convolution: Per-d-channel Toeplitz Matrices"
Update subspace: The fixed low-dimensional subspace limiting how outputs can change when readout rank is small. "restricts the update subspace,"
Warped routing: Richer, non-polyhedral gating boundaries arising when mixing depends on the current input. "Warped routing strictly generalizes polyhe- dral routing"

HyperMLP: An Integrated Perspective for Sequence Modeling

Summary

HyperMLP: Reconceptualizing Autoregressive Attention as Dynamic Two-Layer MLPs

Unified View of Autoregressive Attention

Theoretical Analysis of Expressivity and Routing Geometry

HyperMLP/HyperGLU Block Design

Empirical Performance: Benchmarks and Insights

Implications, Limitations, and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are they trying to answer?

How did they approach the problem?

What did they find, and why does it matter?

What does this mean going forward?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long‑Term Applications

Notes on feasibility and adoption

Glossary

Open Problems

Continue Learning

Collections

Tweets