Selecting Which Transformer Layers to Augment with NOBLE

Determine which subset of transformer linear projections—specifically, the query, key, and value projections, the attention output projection, and both feedforward network linear layers—should be augmented with the NOBLE nonlinear low-rank branch during pretraining to optimize training efficiency and model performance.

Background

The paper introduces NOBLE, a nonlinear low-rank branch added to transformer linear layers, and applies it to all linear projections (Q, K, V, attention output, and both FFN layers) in their LLM experiments.

The authors explicitly note that they have not performed ablations to determine which subset of these layers should be augmented and leave this question for future work, indicating an unresolved design choice with potential performance implications.

References

NOBLE is applied to all linear projections (Q, K, V, output projection, and both FFN layers, we leave ablations on selecting which layers to apply NOBLE to for future work).

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches  (2603.06492 - Smith, 6 Mar 2026) in Language Model Experiments → Autoregressive LLM Pretraining (Setup)