Papers
Topics
Authors
Recent
Search
2000 character limit reached

SimpleGPT Architecture: SimpleNorm Innovation

Updated 4 February 2026
  • SimpleGPT is a GPT-style decoder-only Transformer that applies SimpleNorm after every linear projection to maintain fixed activation norms.
  • SimpleNorm normalizes outputs to invariant scales, linking activation scaling with Hessian spectral norms to allow substantially higher learning rates.
  • Empirical results on models from 1B to 8B parameters show improved training stability and loss reduction compared to PreNorm and QKNorm approaches.

SimpleGPT is a GPT-style decoder-only Transformer architecture distinguished by the systematic application of a novel normalization operator, SimpleNorm, immediately after every linear projection. Designed to address optimization instabilities intrinsic to large-scale Transformer training, SimpleGPT leverages insights from second-order geometry to link architectural choices, activation scaling, Hessian spectral norm, and maximum stable learning rate. Its empirical instantiations demonstrate substantial improvements in training stability and final loss across multiple model scales (1B–8B), and permit learning rates up to an order of magnitude larger than conventionally used methods, including PreNorm and QKNorm (Chen et al., 1 Feb 2026).

1. Model Structure and Forward Pass

SimpleGPT inherits the classic decoder-only Transformer backbone. Its key architectural aspects are:

  • Token Embeddings: ERV×dE\in\mathbb{R}^{|\mathcal V|\times d} for vocabulary V\mathcal{V}. Positional representations include RoPE with base θ=10,000\theta=10,000 (dd-dimensional queries, Llama2-based 1B/7B), RoPE with θ=500,000\theta=500,000 (Llama3-based 8B), and learned positional embeddings for 1.4B parameter (nanoGPT-based) models.
  • Transformer Blocks: A stack of LL blocks, each comprising attention and feed-forward sublayers. Unlike standard PreNorm approaches, no block-level LayerNorm is applied at input. Instead, every linear mapping within each sublayer is immediately followed by SimpleNorm.
  • Output Head: A single linear projection WoutRd×VW_\mathrm{out} \in \mathbb{R}^{d \times |\mathcal V|}, tied or untied to the embedding, producing logits for cross-entropy loss.

A block’s computation involves:

  • Attention sublayer:

q=SN(h1;Wq,γq),k=SN(h1;Wk,γk),v=SN(h1;Wv,γv),q = \mathrm{SN}(h_{\ell-1};W_q,\gamma_q),\quad k = \mathrm{SN}(h_{\ell-1};W_k,\gamma_k),\quad v = \mathrm{SN}(h_{\ell-1};W_v,\gamma_v),

Attn(h1)=SN(softmax(qk/dh)v;Wo,γo)\mathrm{Attn}(h_{\ell-1}) = \mathrm{SN}\bigl(\mathrm{softmax}(qk^{\top}/\sqrt{d_h})\,v;\,W_o,\gamma_o\bigr)

  • Residual:

h~=h1+Attn(h1)\tilde h_\ell = h_{\ell-1} + \mathrm{Attn}(h_{\ell-1})

  • Feed-forward sublayer:

m=SN(h~;W1,γ1),m=Activation(m),FFN(h~)=SN(m;W2,γ2)m = \mathrm{SN}(\tilde h_\ell; W_1, \gamma_1),\quad m' = \mathrm{Activation}(m),\quad \mathrm{FFN}(\tilde h_\ell) = \mathrm{SN}(m'; W_2, \gamma_2)

h=h~+FFN(h~)h_\ell = \tilde h_\ell + \mathrm{FFN}(\tilde h_\ell)

SimpleNorm (see Section 2) is denoted SN()\mathrm{SN}(\cdot) above.

2. SimpleNorm Operator: Mathematical Formulation

SimpleNorm directly normalizes every linear projection. For an input xRmx \in \mathbb{R}^m, weight WRd×mW \in \mathbb{R}^{d \times m}, and learnable scaling γRd\gamma \in \mathbb{R}^d:

SN(x;W,γ)=γ(dWxWx2)\mathrm{SN}(x;W,\gamma) = \gamma \odot \left( \sqrt{d} \, \frac{W x}{\|W x\|_2} \right)

  • γ\gamma is a per-dimension scaling parameter (as in RMSNorm).
  • The d\sqrt{d} factor ensures activation norm γ2\approx \|\gamma\|_2.
  • Equivalently:
    • z=Wxz = W x
    • s=z2s = \|z\|_2
    • u=z/su = z / s, with u2=1\|u\|_2 = 1
    • D=diag(γ)D = \operatorname{diag}(\gamma)
    • SN(x;W,γ)=dDu\mathrm{SN}(x;W,\gamma) = \sqrt d\,D\,u

This ensures that all linear projections have outputs with a fixed norm controlled only by γ\gamma, providing invariant scaling regardless of W2\|W\|_2.

3. Comparison to LayerNorm and QKNorm

Standard PreNorm GPT applies LayerNorm to the block input, then performs attention and MLP projections. QKNorm introduces LayerNorms on qq and kk prior to the dot product, but still allows the norm of qq, kk, or vv to scale with WW.

In contrast, SimpleGPT applies SimpleNorm directly after every linear mapping—attention projections (WqW_q, WkW_k, WvW_v, WoW_o) and feed-forward projections (W1W_1, W2W_2):

  • Guarantees Wx2d\|W x\|_2 \approx \sqrt d for all such outputs.
  • Removes all block-level LayerNorms.
  • Prevents scale drift, explosion, or collapse in intermediate activations.
Projection PreNorm QKNorm SimpleGPT
Attention q,k,v LN + WW WW + LN (q,k) SNSN (all)
FFN LN + W1W_1/W2W_2 LN + W1W_1/W2W_2 SNSN (W1W_1 and W2W_2)
Output WoW_o + LN WoW_o + LN SNSN (WoW_o)

The activation norm control at each projection step is unique to SimpleGPT.

4. Geometric Analysis: Hessian Spectral Norm and Stability

Optimization stability is governed by the largest eigenvalue of the Hessian Hxx=x2(x)H_{xx} = \nabla_x^2\ell(x), which bounds the maximum learning rate ηmax2/β\eta_{\max} \leq 2 / \beta, where β=supxHxx2\beta = \sup_x \|H_{xx}\|_2.

SimpleNorm’s forward and backward pass yields:

Hxx=JHyyJ+CH_{xx} = J^\top H_{yy} J + C

where J=dWx2D(Iuu)WJ = \frac{\sqrt{d}}{\|W x\|_2} D (I - u u^\top) W, and CC is a secondary curvature term with norm C23κ2dgy2\|C\|_2 \leq \frac{3 \kappa^2}{\sqrt d}\|g_y\|_2. In high dimension, CL\|C\| \ll \|L\|.

  • The spectral norm of the SimpleNorm Hessian is independent of W2\|W\|_2, making it scale-invariant.
  • For an ordinary linear layer, the Hessian grows as W22\|W\|_2^2 during training, potentially producing large curvature and instability at high learning rates.

The result is that

supxHxxsn(x)2supxHxxlin(x)2\sup_x \|H_{xx}^{\rm sn}(x)\|_2 \ll \sup_x \|H_{xx}^{\rm lin}(x)\|_2

A plausible implication is improved training stability in large models as weight norms increase.

5. Stable High Learning Rate Regimes

The theoretical reduction in Hessian spectral norms enables SimpleGPT to tolerate larger stable learning rates, empirically observed as:

ηsn[3×,10×]ηstandard\eta_{\rm sn} \in [3\times,\, 10\times]\,\eta_{\rm standard}

This matches the predicted reduction in curvature by W22\sim \|W\|_2^2 for large-scale models, allowing orders-of-magnitude gains in learning rate over PreNorm and QKNorm configurations without loss collapse or divergence.

6. Empirical Configurations and Training Protocol

Empirical validations span four model scales, with the following configurations:

Model Origin Layers LL dmodeld_\mathrm{model} dffnd_\mathrm{ffn} Heads Seq × Batch Steps LR (SimpleGPT) Weight Decay
1B Llama 2 18 2,048 5,632 16 512×256 200K 2×1032\times10^{-3} (10×) 0.05
1.4B nanoGPT 48 1,536 6,144 24 1024×512 100K 6×1046\times10^{-4} (3×) 0.10
7B Llama 2 32 4,096 11,008 32 2048×192 20K/40K/60K 1×1031\times10^{-3} (3×) 0.05
8B Llama 3 32 4,096 14,336 32 2048×192 20K 3×1033\times10^{-3} (3×) 0.05
  • Optimizer: AdamW (β1=0.9,β2=0.95)(\beta_1=0.9,\, \beta_2=0.95)
  • Schedule: Cosine LR, bfloat16 precision, A800 GPU hardware
  • Datasets: C4 (Llama2/3), OpenWebText (nanoGPT 1.4B)
  • Overhead: Minimal, with 3%\sim3\% additional step time

7. Ablation Studies and Performance Outcomes

Max-LR tolerance, loss improvement, and robustness to hyperparameters are reported:

  • Max-LR Tolerance (1B Llama2):
    • PreNorm diverges at η=2×103\eta=2\times10^{-3}
    • PreNorm+QKNorm: stable up to 2×1022\times10^{-2}, diverges at 2×1012\times10^{-1}
    • SimpleNorm: stable even at η=2×101\eta=2\times10^{-1}
  • Training Loss Improvements:
Model & Steps PreNorm QKNorm SimpleGPT Δ\Delta (SimpleGPT – QKNorm)
1B (200K) 2.478 2.478 2.446 –0.032
1.4B (100K) 3.010 3.010 2.967 –0.043
7B (60K) 2.290 2.290 2.208 –0.082
8B (20K) 2.320 2.320 2.240 –0.080
  • Learning-Rate Sweep (1B): SimpleGPT matches or outperforms QKNorm across all learning rates, with the margin increasing at higher rates.
  • Weight Decay Robustness: SimpleGPT’s improvement (loss reduction  0.05~0.05–$0.08$) is sustained as weight decay is varied from $0.1$ to $0.03$ in 7B and 8B settings.

In summary, per-projection normalization through SimpleNorm ensures stable activation scales and Hessian conditioning, permitting learning rates $3$–10×10\times larger and yielding reliable reductions in training loss across the $1$B–$8$B parameter scale, with low computational overhead and resilience to hyperparameter perturbation (Chen et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SimpleGPT Architecture.