HyperMLP: An Integrated Perspective for Sequence Modeling
Abstract: Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper suggests a simpler way to think about how Transformers process sequences like text. Instead of seeing “attention” as choosing past words using probabilities, the authors show it can be viewed as a tiny two-layer calculator (an MLP) that changes its settings on the fly based on the words you’ve already seen. Using this idea, they build new attention-like blocks called HyperMLP and HyperGLU that make models smarter and more efficient under the same size limits.
Think of it like this: when writing a story word by word, the model keeps a growing “notebook” of past words. Traditional attention treats this notebook like a list it picks from using probabilities. HyperMLP/HyperGLU treat it like a toolbox of switches and sliders that the model learns to rearrange and use depending on the current word and the whole notebook.
What questions are they trying to answer?
The paper asks:
- Can we replace probability-based attention (softmax) with a simpler, more flexible “gating” style and still do well—or even better?
- Can we make attention more expressive by learning how to mix information across the sequence itself, not just across features?
- How do design choices in attention (like where to put gates or compress ranks) actually affect a model’s abilities?
- Do these new blocks beat strong baselines in practice when the number of parameters is kept the same?
How did they approach the problem?
The authors reframe “self-attention” as a dynamic two-layer MLP:
- An MLP (multi-layer perceptron) is a basic neural network with layers that apply linear transforms and simple activations like ReLU. ReLU is like a switch: it keeps positive numbers and zeroes out negatives.
- In standard attention, the model computes “scores” between the current word and each past word, then uses softmax to turn those scores into probabilities and averages the past information.
- In their view, those scores are not probabilities; they are the hidden layer of a two-layer MLP whose “weights” are created from the context (the history of words). The model uses gates (ReLU or GLU) to decide which parts of the context to activate, like turning on some switches based on the input.
They then design two new blocks:
- HyperMLP:
- Learns to mix information across both feature space (which describes what a word is like) and sequence space (which describes where it is in the history), using efficient low-rank factors. “Low-rank” means they keep only the most important directions, like summarizing a big photo into a small but useful version.
- Uses a “lag” layout: it treats the history in reverse order (newest to oldest). This keeps behavior consistent as the sequence grows, which matches how models generate text one token at a time.
- Normalizes with simple L2 scaling (like adjusting volume so the signal doesn’t blow up) rather than probability normalization.
- HyperGLU:
- Similar to HyperMLP but replaces the ReLU with GLU, which splits the hidden signal into two parts: one decides “which slots to use” (selection), and the other controls “how strongly to use them” (scale). Imagine one knob chooses the tools, and a second knob sets their power.
Everyday analogies for technical terms:
- Autoregressive: writing text one word at a time, only seeing what you already wrote.
- Attention scores: a set of numbers measuring relevance; here treated as switch settings, not probabilities.
- Sequence mixing: like rearranging and blending your notebook’s pages to create new helpful summaries before reading from them.
- Low-rank and diagonal-plus-low-rank: efficient ways to mix information without handling huge matrices, like using a small set of levers rather than a giant control panel.
- KV cache: a memory of past processed features the model can quickly access.
They test their ideas with:
- Controlled design studies on a diagnostic benchmark (MAD) and a small GPT-style trainer (NanoGPT), carefully matching parameter budgets to isolate effects.
- Full language-model training at two scales (340M and 1.3B parameters) on large datasets, evaluated on many public benchmarks.
What did they find, and why does it matter?
Key findings:
- Softmax probabilities aren’t essential. Replacing softmax with ReLU-style gates often matches or beats softmax under the same budget when paired with good feature handling.
- Mixing along the sequence dimension is a big win. Learning how to combine information across past positions gives strong improvements in tasks that need retrieval and copying, like finding and reusing earlier content.
- The “lag” layout (reverse order) is necessary. Without it, temporal mixing breaks the autoregressive logic and performance crashes. With it, results are much better and consistent as the sequence length changes.
- Mixing on both sides helps. Applying sequence mixing on the “routing” side (which slots are active) and the “readout” side (how to combine them) is better than mixing on just one.
- Budget matters: keep more rank on the readout side. Compressing the readout side hurts update diversity more than compressing the routing side. This explains why some popular fine-tuning tricks work better on certain projections.
- HyperGLU tends to be the best variant. Splitting selection from magnitude makes models more robust and leads to the strongest overall results in their tests.
In numbers (high-level takeaways):
- On diagnostic tasks (MAD), adding sequence mixing with lag layout boosts performance a lot, especially on retrieval/copy tasks.
- In NanoGPT training, models with HyperMLP/HyperGLU reach lower losses faster under the same training setup.
- On broader language-model evaluations, HyperMLP and especially HyperGLU consistently rank higher than strong softmax-attention and other modern sequence models at matched sizes.
What does this mean going forward?
Implications:
- Rethinking attention as a dynamic two-layer MLP simplifies design and explains many empirical tips the community discovered by trial and error. It suggests focusing on learned gating and sequence mixing rather than strictly preserving probability distributions.
- Under the same parameter and compute budget, you can get more capability by learning how to rearrange and gate the sequence memory. This is promising for building smaller, stronger, and more efficient models.
- Engineering challenges remain. HyperMLP/HyperGLU introduce new sequence mixing steps that don’t yet plug into highly optimized kernels like FlashAttention, so matching raw speed needs further work.
- Scaling up is the next step. The paper shows consistent gains at moderate sizes; testing at very large “frontier” scales could confirm the value of this approach across domains like language, vision, and speech.
In short: the paper turns attention into a flexible, learnable calculator that selects and mixes context in smarter ways. That shift delivers better performance without increasing model size, opening a practical path to more capable and efficient sequence models.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide follow‑up research.
- Scaling to frontier sizes: Validate HyperMLP/HyperGLU at tens to hundreds of billions of parameters and ≥1T tokens, including convergence behavior, training stability, memory footprint, and whether gains persist or change with scale.
- Kernel-level optimization: Design and benchmark fused GPU kernels for sequence mixing (e.g., FlashAttention-like implementations for DPLR mixers), quantify wall-clock speed, throughput, and memory vs. highly optimized attention.
- Long-context performance: Empirically assess extension-consistency claims beyond 4K tokens (e.g., 32K–128K). Test on long-context benchmarks (LRA variants, long-document QA, needle-in-a-haystack, retrieval-heavy tasks) and report degradation curves.
- Sequence-mixing rank selection: Systematically study how the DPLR rank rs affects expressivity, training dynamics, and compute across layers/heads; explore adaptive or data-driven rank schedules and per-layer heterogeneity.
- Number of heads: The paper fixes nhead = 2 for efficiency. Characterize performance, stability, and budget trade-offs across typical head counts (e.g., 8–32), including head specialization/diversity and interactions with sequence mixing.
- Activation and gating choices: Beyond ReLU/GLU, evaluate SiLU/GELU/leaky-ReLU and alternative GLU variants; isolate their effects on routing geometry, optimization, and generalization under identical budgets.
- Normalization strategy: L2Norm over the hidden vector is assumed as a scalar rescaling. Compare RMSNorm/LayerNorm/affine-free variants applied pre/post mixing; analyze training stability, gradient conditioning, and empirical routing invariance to normalization changes.
- Optimization dynamics: Provide a deeper analysis of gradient flow, gate saturation, margin dynamics, and curvature induced by xt-dependent mixers; identify failure modes (e.g., collapse without lag layout) and mitigation strategies.
- KV-cache integration: Precisely characterize incremental decoding with DPLR mixing—cache states, update rules, memory growth, and latency. Benchmark streaming inference (batch size, context growth) and caching strategies across long sequences.
- Modal transferability: Test HyperMLP/HyperGLU on vision (ViT), speech, reinforcement learning (Decision Transformer), and multi-modal tasks to verify the claimed unified perspective beyond language.
- Benchmark breadth and robustness: Expand evaluation to coding (Humaneval/MBPP), reasoning/tool-use (GSM8K, ASDiv), instruction-following (AlpacaEval, MT-Bench), multilingual tasks, and out-of-distribution robustness; report statistical significance across seeds.
- Energy and compute profiling: Quantify energy usage, FLOPs, memory bandwidth, and training/inference cost per token compared to softmax and other selective state-space models; assess carbon footprint and efficiency trade-offs.
- Mechanistic interpretability: Operationalize the pool/activated-memory view—extract active slot sets, measure routing margins, visualize curved decision boundaries, and relate them to known transformer circuits (induction heads, copy mechanisms).
- Formal expressivity guarantees: Provide universality or approximation theorems for dynamic two-layer MLPs with bounded feature and sequence ranks; bound function classes vs. softmax attention and linear attention under matched budgets.
- Registers in practice: The paper argues registers enlarge the hidden pool; empirically test register count/placement and quantify gains vs. parameter cost, including interactions with DPLR mixing.
- Budget asymmetry under fine-tuning: Extend QK vs. VO compression insights to parameter-efficient tuning (LoRA, adapters) and task-specific specialization; establish guidelines for where to place low-rank adapters and gates in HyperMLP/GLU.
- Alternative sequence mixing designs: Compare DPLR with Toeplitz/conv kernels, low-discrepancy permutations, orthogonal cores (e.g., Lie-group parameterizations), and learned pooling/subsampling; analyze efficiency–expressivity trade-offs.
- Positional semantics: Evaluate interactions with RoPE/relative position encodings and alternatives (ALiBi, T5-style). Determine when learned sequence mixing can replace or complement explicit positional encodings.
- Subquadratic variants: Investigate whether sequence mixing can be structured to enable subquadratic training/inference (e.g., sparse/DPLR+convolution hybrids, low-rank plus local windows) while preserving HyperMLP expressivity.
- Quantization and pruning: Study compatibility with 8-bit/4-bit training/inference and structured sparsity; measure how gates and sequence mixing affect quantization error and pruning sensitivity.
- Transfer and continual learning: Test how context-instantiated weights adapt under domain shift and incremental data; analyze catastrophic forgetting and whether sequence mixing provides more robust in-context adaptation.
- Layerwise design: Explore how mixing ranks, activations, and normalization should vary by depth; identify whether early vs. late layers benefit differently from sequence mixing and GLU routing.
- Assumption sensitivity: Several proofs rely on scalar L2 normalization and padding invariance; probe empirical sensitivity to deviations (e.g., non-scalar norms, variable-length padding) and identify conditions needed for lag-layout invariance.
- Data boundaries and segmentation: Examine how lag layout and sequence mixing behave across document boundaries, resets, and concatenated contexts; propose methods for boundary-aware mixing or slot masking.
- Multi-head coordination: Investigate whether heads learn complementary sequence bases; develop diagnostics and regularizers for head diversity and cross-head routing interference.
Practical Applications
Immediate Applications
Below are concrete, deployable uses that leverage HyperMLP/HyperGLU’s dynamic two‑layer MLP view of attention, learned sequence mixing (DPLR), and GLU/ReLU routing under matched parameter budgets.
- Drop‑in replacement heads for small–mid‑scale Transformers to boost retrieval/copy behaviors
- Sectors: software, education, content generation
- What: Replace softmax attention with HyperGLU/HyperMLP in autoregressive decoders for better in‑context recall, selective copy, and long‑range retrieval (as shown by MAD and LM benchmarks).
- Tools/workflows: PyTorch layer modules for “HyperGLUAttention,” config flag in training scripts; evaluation with MAD and Open LLM Leaderboard suites.
- Assumptions/dependencies: Current kernels are not as optimized as FlashAttention; best suited where slight latency overhead is acceptable; recommended nhead≈2 and lag (reverse‑offset) layout.
- Fine‑tuning recipes that reallocate rank and adapters per theory
- Sectors: software, enterprise AI
- What: Apply “budget asymmetry” guidance—compress QK, preserve VO rank; attach LoRA/gates on VO side for parameter‑efficient adaptation; expect more update directions per parameter.
- Tools/workflows: PEFT plug‑ins (“VO‑LoRA”, VO‑side gating), training checklists codifying Proposition 2.6 and Theorem 2.5.
- Assumptions/dependencies: Autoregressive blocks; stable with L2 normalization of scores; monitor update‑subspace coverage.
- Retrieval‑heavy assistants and code completion with stronger internal copy
- Sectors: software development, productivity
- What: Improve selective copy and fuzzy recall in IDE copilots, document assistants, meeting summarizers without changing RAG backends.
- Tools/workflows: Swap decoder attention blocks; unit tests focused on copy/recall; prompt templates that exploit longer prefixes.
- Assumptions/dependencies: Throughput may dip without fused kernels; evaluate under your latency SLOs.
- Time‑series forecasting and anomaly detection with lag‑consistent mixing
- Sectors: retail (demand), AIOps (logs), energy (load), finance (signals)
- What: Use HyperMLP’s lag layout and sequence mixing to better combine nearby lags and long‑range patterns in AR forecasters.
- Tools/workflows: Torch forecasting stacks (e.g., PyTorch Forecasting) with HyperGLU decoders; backtesting pipelines.
- Assumptions/dependencies: AR formulation; choose DPLR rank rs and convolution options; validate truncation invariance under sliding windows.
- Speech and translation decoders with improved AR routing
- Sectors: speech, localization
- What: Replace decoder self‑attention to sharpen long‑context alignment and robustness in streaming ASR and NMT.
- Tools/workflows: Integrate in ESPnet/Fairseq decoders; measure WER/BLEU at matched parameters.
- Assumptions/dependencies: Kernel efficiency may limit streaming latency on edge; start in server/batch or research settings.
- Decision Transformer pipelines with better long‑range credit assignment
- Sectors: robotics, reinforcement learning
- What: Use HyperGLU decoders for sequence‑modeled returns/actions to improve retrieval of distant returns and sub‑trajectory patterns.
- Tools/workflows: Gym/DMControl experiments; ablations on GLU vs ReLU routing.
- Assumptions/dependencies: Sequential policy inference; offline RL datasets with long horizons.
- Interpretability and eval workflows that avoid “attention as probability” pitfalls
- Sectors: safety, compliance, academia
- What: Replace attention‑map explanations with “activated slot tracing” (trace positive coordinates of ht) consistent with dynamic MLP routing.
- Tools/workflows: Logging hooks that record active slots and pool atoms; MAD‑style diagnostics for selection/copy.
- Assumptions/dependencies: Reframe governance docs—attention weights are gates, not calibrated probabilities.
- Curriculum and methodology for teaching/analysis of attention as dynamic MLP
- Sectors: academia, training
- What: Use the three‑stage memory view, lag layout, and budget asymmetry to teach/diagnose Transformer behavior.
- Tools/workflows: Lab assignments reproducing MAD gains; visualization of warped routing vs polyhedral partitions.
- Assumptions/dependencies: Autoregressive context; small compute sufficient for didactic demos.
- Batch/offline inference where quality>latency
- Sectors: data labeling, offline analytics
- What: Run HyperGLU models for offline summarization, labeling, or data augmentation to maximize quality per parameter.
- Tools/workflows: Batch generation pipelines; quality checkpoints at equal token budgets.
- Assumptions/dependencies: No real‑time constraints; throughput costs acceptable pending kernel optimizations.
Long‑Term Applications
These require further research, scaling, systems work, or ecosystem adoption before wide deployment.
- Fused kernels and compiler support for DPLR sequence mixing
- Sectors: software infrastructure, semiconductors
- What: FlashAttention‑grade kernels (“FlashHyper”) that fuse L2Norm, GLU gating, and DPLR mixing; scheduler support in Triton/XLA.
- Tools/products: CUDA/Triton kernels; ONNX opset extensions; TVM autotuning.
- Assumptions/dependencies: Cross‑vendor GPU/NPU support; numerical stability and memory‑bandwidth tuning.
- Frontier‑scale LLM pretraining with HyperGLU
- Sectors: cloud AI, foundation models
- What: Train multi‑billion parameter HyperGLU models on 0.5–2T tokens to validate scaling, long‑context ICL, and benchmark leadership.
- Tools/workflows: Megatron‑LM/DeepSpeed integrations; context length ≥32k; long‑context evals.
- Assumptions/dependencies: Significant compute; mixture‑of‑experts compatibility; kernel maturity to maintain throughput.
- On‑device real‑time assistants via capability‑per‑parameter gains
- Sectors: mobile, embedded, IoT
- What: Combine smaller HyperGLU models with mobile NPUs to deliver higher quality at the same memory footprint.
- Tools/products: Android NNAPI/Core ML delegate for HyperGLU ops; quantization/weight sharing tuned for GLU routes.
- Assumptions/dependencies: Hardware kernels for DPLR; energy‑aware scheduling; further latency reductions.
- Multimodal HyperGLU for vision, audio, and video
- Sectors: vision (ViT/segmentation), media understanding
- What: Apply learned sequence mixing across patches/frames to improve retrieval of distant frames and context‑wide slots.
- Tools/workflows: ViT backbones with HyperGLU token mixers; audiovisual transformers in ASR/video QA.
- Assumptions/dependencies: Adapt lag semantics to bidirectional or relative layouts; new positional schemes.
- Memory‑augmented and long‑context systems without heavy external RAG
- Sectors: enterprise search, legal, scientific analysis
- What: Use “context‑wide slot” mixing to internalize retrieval within long prompts, reducing dependence on external indices.
- Tools/workflows: Chunking strategies that align with lag layout; dynamic slot introspection for cache management.
- Assumptions/dependencies: Efficient long‑context kernels; memory‑safe caching; evaluation on >128k tokens.
- Robotics/control stacks with online AR consistency
- Sectors: robotics, industrial automation
- What: Controllers using HyperGLU for better sequence routing over lagged sensor histories; improved stability with extension consistency.
- Tools/workflows: ROS2 modules; sim‑to‑real pipelines with Decision Transformer variants.
- Assumptions/dependencies: Deterministic low‑latency kernels; safety/verification of warped routing.
- Healthcare sequential modeling (EHR trajectories, ICU)
- Sectors: healthcare
- What: Predict outcomes by mixing temporal events (labs, meds) with learned lag‑aware slots; better utilization of long histories.
- Tools/workflows: De‑identified EHR pipelines; conformal risk controls; drift monitors.
- Assumptions/dependencies: Regulatory compliance; interpretability tooling tailored to activated‑slot view; domain validation.
- Financial forecasting and execution
- Sectors: finance
- What: Apply lag‑consistent temporal mixing to price/volume microstructure and event sequences for improved signal extraction.
- Tools/workflows: Backtesting with walk‑forward validation; risk controls on non‑stationarity.
- Assumptions/dependencies: Strict latency budgets in HFT require kernels; robust regularization to avoid overfitting.
- Distillation and compression strategies exploiting routing/readout asymmetry
- Sectors: model compression, edge AI
- What: Distill softmax‑attention teachers into HyperGLU students that keep VO rank while compressing QK; new pruning that preserves readout subspace.
- Tools/workflows: Head‑wise rank scheduling; subspace‑aware pruning; KD losses targeting active‑slot sets.
- Assumptions/dependencies: Teacher‑student infrastructure; careful stability tuning of GLU gates.
- Standards and policy guidance on explainability and energy
- Sectors: policy, governance
- What: Update explainability frameworks to avoid treating attention scores as probabilities; promote evaluation based on active‑slot routing; encourage capability‑per‑token metrics for greener AI.
- Tools/workflows: Audit checklists; reporting formats for routing diagnostics; procurement guidance prioritizing smaller, more capable models.
- Assumptions/dependencies: Community consensus; regulator education; availability of standardized routing metrics.
Notes on feasibility and adoption
- Best near‑term wins appear in research and smaller production systems where a modest throughput penalty is acceptable in exchange for quality gains per parameter.
- Broad deployment hinges on systems work: fused kernels, memory‑efficient caching for DPLR mixing, and hardware support.
- The approach is naturally aligned with autoregressive settings; bidirectional encoders may require adapted layouts and training objectives.
Glossary
- Activated set (routing): The subset of hidden coordinates selected by activation based on score signs during dynamic gating. "Routing is the active-set partition induced by the sign pattern of ht."
- Autoregressive attention head: A single attention mechanism applied causally to past tokens, here reframed as a dynamic two-layer MLP. "an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history."
- Autoregressive generation (AR): Sequence modeling where each output depends on previous outputs; training/inference respects causal order. "In autoregressive (AR) generation, attention enables parallel training and efficient inference:"
- Autoregressive truncation invariance: The property that extending the far past does not change current outputs under a lag layout. "Lag layout: extension consistency implies AR truncation invariance"
- Block-diagonal rotation (RoPE core): A structured feature-space transformation applying independent rotations per sequence channel. "inserting a block-diagonal (per sequence channel of X) rotation,"
- Budget asymmetry (QK vs. VO): Unequal impact of allocating rank/parameters between first-layer (QK) and second-layer (VO) components. "Budget asymmetry in residual two-layer blocks"
- Content-addressable lookup: Attention interpretation where queries retrieve values by matching against keys and weighting results. "a content-addressable lookup in which a query matches against keys, softmax converts the result- ing scores into a distribution over positions, and the output is an expectation-style read"
- Depthwise convolution (KV-side): Channel-wise convolution applied to key/value projections to mix local temporal information. "we still use depthwise convolution in the default HyperMLP/HyperGLU,"
- Diagonal-plus-low-rank (DPLR): A matrix parameterization combining a learned diagonal with a low-rank update for efficient sequence mixing. "Both are input-conditioned with low-rank or diagonal-plus- low-rank (DPLR) forms:"
- Dynamic two-layer MLP: A two-layer perceptron whose effective weights depend on the current context, here aligning with attention. "(i) Dynamic two-layer MLP."
- Fast-weight programming: Approaches that generate weights on the fly conditioned on recent inputs, related to dynamic maps. "classic fast-weight programming (Schmidhuber, 1992; Schlag et al., 2021)"
- FlashAttention: An IO-aware, fused attention kernel for speed and memory efficiency; not directly reusable here. "cannot directly reuse existing efficient backends such as FlashAttention (Dao et al., 2022)."
- GEMM operations: General matrix-matrix multiplications used for efficient batched linear algebra in implementations. "leverages efficient GEMM operations after com- pilation."
- GLU (Gated Linear Unit): An activation that gates and modulates magnitudes, separating selection from scaling. "HyperGLU replaces the ReLU hidden activation by a GLU-style modulation."
- HyperGLU: The proposed attention variant using GLU-style routing with learned sequence/feature mixing. "HyperGLU replaces the ReLU hidden activation by a GLU-style modulation."
- HyperMLP: The proposed attention-as-MLP architecture with learned input-conditioned sequence/feature mixing. "we propose HyperMLP, which learns input-conditioned mixing in both feature space and sequence space"
- Hyperplane arrangement: A partition of input space by linear boundaries induced by ReLU gates in static mixing. "this partition is the usual hyperplane arrangement of a two-layer ReLU map;"
- In-context learning: The ability of models to perform tasks using context without parameter updates, linked to prefix access. "this prefix access supports long-range retrieval and is closely tied to in-context learning behaviors"
- KV cache: Incrementally maintained key/value tensors storing past projections for efficient autoregressive attention. "an incrementally maintained KV cache."
- Lag layout (reverse-offset): Ordering the prefix from newest to oldest to align sequence mixing with autoregressive semantics. "temporal mixing without the reverse- offset (lag) layout collapses."
- Learnable registers: Extra trainable tokens appended to the sequence to expand the hidden pool/capacity. "Learnable registers enlarge the hidden pool"
- Linear attention: Attention variant with linear normalization that removes gating, reading from the full pool. "Linear attention collapses routing/selection"
- LoRA (Low-Rank Adaptation): A fine-tuning method adding low-rank adapters to weight matrices for parameter-efficient updates. "LoRA and gates are most parameter-efficient on the readout side (V/O)"
- Low-rank parameterization: Factorizing large matrices into low-rank components to reduce parameters and control expressivity. "conditioned low-rank parameterizations for feature and sequence mixing"
- L2 normalization (RMSNorm-like): Scaling vectors by their L2 norm (without affine terms) for stabilization instead of softmax. "with L2 normalization (similar to RMSNorm) instead of probability normalization"
- Multi-head attention: Parallel attention heads whose outputs are combined, here as sums of dynamic MLPs. "multi-head attention is simply the sum of nhead parallel dynamic MLPs"
- Probability-simplex constraint: The requirement that attention weights form a probability distribution via softmax. "the probability-simplex constraint may not be essential and can be restrictive:"
- Readout (VO) side: The value/output transformation in attention corresponding to the second layer of the dynamic MLP. "output gating (Qiu et al., 2025) inserts a diagonal gate on the readout side,"
- ReLU attention: Attention variant replacing softmax with normalized ReLU gating over scores. "we refer to the increasingly studied alternative that replaces softmax with a normalized ReLU-style map as ReLU attention"
- Residual connections: Skip connections adding inputs to outputs of blocks to stabilize training and support retrieval. "Stacked with residual connections, this prefix access supports long-range retrieval"
- RoPE (Rotary Positional Embeddings): Position encoding via rotations in feature space to encode relative positions. "With RoPE (Su et al., 2024), the parameterization of W(1) MLP (X) is altered by inserting a block-diagonal (per sequence channel of X) rotation,"
- Scaling laws: Empirical relations between model size, data, compute, and performance guiding training regimes. "The empirical "scaling laws" have further pushed models toward larger parameter size training"
- Sequence mixing: Learnable transformations along the sequence dimension to form context-wide slots/bases. "we learn explicit sequence mixing to relax fixed po- sitional coordinates,"
- Softmax attention: The standard attention using softmax-normalized scores as probabilities over positions. "softmax-attention baselines"
- Teacher forcing: Training regime feeding ground-truth tokens during sequence model training over full contexts. "Over length-T teacher forcing, this yields a total O(T2rs) overhead,"
- Toeplitz matrices: Structured matrices representing local convolutions with constant diagonals across offsets. "Local Convolution: Per-d-channel Toeplitz Matrices"
- Update subspace: The fixed low-dimensional subspace limiting how outputs can change when readout rank is small. "restricts the update subspace,"
- Warped routing: Richer, non-polyhedral gating boundaries arising when mixing depends on the current input. "Warped routing strictly generalizes polyhe- dral routing"
Collections
Sign up for free to add this paper to one or more collections.