Validate HyperMLP/HyperGLU at frontier LLM scales

Determine whether the HyperMLP and HyperGLU architectures maintain their capability and efficiency advantages when scaled to very large, practical large language model sizes by training and evaluating HyperMLP/HyperGLU at frontier scales and comparing against strong softmax-attention Transformer baselines.

Background

The paper reframes autoregressive attention as a dynamic two-layer MLP and proposes HyperMLP and HyperGLU, which add input-conditioned mixing in both feature and sequence dimensions. The authors provide theoretical characterizations and demonstrate empirical gains over softmax attention under matched parameter budgets.

Due to resource constraints, experiments are conducted at moderate scales (e.g., 340M and 1.3B parameters). The authors explicitly state that evaluating the architectures at much larger, practical frontier scales remains to be done, leaving the question of whether the observed advantages persist at those scales.

References

Additionally, due to resource constraints, we do not scale HyperMLP/HyperGLU to very large practical LLM sizes; validating performance at frontier scales is left for future work.

HyperMLP: An Integrated Perspective for Sequence Modeling  (2602.12601 - Lu et al., 13 Feb 2026) in Section 4, Conclusion and Limitations