Selecting Which Transformer Layers to Augment with NOBLE
Determine which subset of transformer linear projections—specifically, the query, key, and value projections, the attention output projection, and both feedforward network linear layers—should be augmented with the NOBLE nonlinear low-rank branch during pretraining to optimize training efficiency and model performance.
References
NOBLE is applied to all linear projections (Q, K, V, output projection, and both FFN layers, we leave ablations on selecting which layers to apply NOBLE to for future work).
— NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
(2603.06492 - Smith, 6 Mar 2026) in Language Model Experiments → Autoregressive LLM Pretraining (Setup)