SimpleGPT Architecture: SimpleNorm Innovation

Updated 4 February 2026

SimpleGPT is a GPT-style decoder-only Transformer that applies SimpleNorm after every linear projection to maintain fixed activation norms.
SimpleNorm normalizes outputs to invariant scales, linking activation scaling with Hessian spectral norms to allow substantially higher learning rates.
Empirical results on models from 1B to 8B parameters show improved training stability and loss reduction compared to PreNorm and QKNorm approaches.

SimpleGPT is a GPT-style decoder-only Transformer architecture distinguished by the systematic application of a novel normalization operator, SimpleNorm, immediately after every linear projection. Designed to address optimization instabilities intrinsic to large-scale Transformer training, SimpleGPT leverages insights from second-order geometry to link architectural choices, activation scaling, Hessian spectral norm, and maximum stable learning rate. Its empirical instantiations demonstrate substantial improvements in training stability and final loss across multiple model scales (1B–8B), and permit learning rates up to an order of magnitude larger than conventionally used methods, including PreNorm and QKNorm (Chen et al., 1 Feb 2026).

1. Model Structure and Forward Pass

SimpleGPT inherits the classic decoder-only Transformer backbone. Its key architectural aspects are:

Token Embeddings: $E\in\mathbb{R}^{|\mathcal V|\times d}$ for vocabulary $\mathcal{V}$ . Positional representations include RoPE with base $\theta=10,000$ ( $d$ -dimensional queries, Llama2-based 1B/7B), RoPE with $\theta=500,000$ (Llama3-based 8B), and learned positional embeddings for 1.4B parameter (nanoGPT-based) models.
Transformer Blocks: A stack of $L$ blocks, each comprising attention and feed-forward sublayers. Unlike standard PreNorm approaches, no block-level LayerNorm is applied at input. Instead, every linear mapping within each sublayer is immediately followed by SimpleNorm.
Output Head: A single linear projection $W_\mathrm{out} \in \mathbb{R}^{d \times |\mathcal V|}$ , tied or untied to the embedding, producing logits for cross-entropy loss.

A block’s computation involves:

Attention sublayer:

$q = \mathrm{SN}(h_{\ell-1};W_q,\gamma_q),\quad k = \mathrm{SN}(h_{\ell-1};W_k,\gamma_k),\quad v = \mathrm{SN}(h_{\ell-1};W_v,\gamma_v),$

$\mathrm{Attn}(h_{\ell-1}) = \mathrm{SN}\bigl(\mathrm{softmax}(qk^{\top}/\sqrt{d_h})\,v;\,W_o,\gamma_o\bigr)$

Residual:

$\tilde h_\ell = h_{\ell-1} + \mathrm{Attn}(h_{\ell-1})$

Feed-forward sublayer:

$m = \mathrm{SN}(\tilde h_\ell; W_1, \gamma_1),\quad m' = \mathrm{Activation}(m),\quad \mathrm{FFN}(\tilde h_\ell) = \mathrm{SN}(m'; W_2, \gamma_2)$

$h_\ell = \tilde h_\ell + \mathrm{FFN}(\tilde h_\ell)$

SimpleNorm (see Section 2) is denoted $\mathrm{SN}(\cdot)$ above.

2. SimpleNorm Operator: Mathematical Formulation

SimpleNorm directly normalizes every linear projection. For an input $x \in \mathbb{R}^m$ , weight $W \in \mathbb{R}^{d \times m}$ , and learnable scaling $\gamma \in \mathbb{R}^d$ :

$\mathrm{SN}(x;W,\gamma) = \gamma \odot \left( \sqrt{d} \, \frac{W x}{\|W x\|_2} \right)$

$\gamma$ is a per-dimension scaling parameter (as in RMSNorm).
The $\sqrt{d}$ factor ensures activation norm $\approx \|\gamma\|_2$ .
Equivalently:
- $z = W x$
- $s = \|z\|_2$
- $u = z / s$ , with $\|u\|_2 = 1$
- $D = \operatorname{diag}(\gamma)$
- $\mathrm{SN}(x;W,\gamma) = \sqrt d\,D\,u$

This ensures that all linear projections have outputs with a fixed norm controlled only by $\gamma$ , providing invariant scaling regardless of $\|W\|_2$ .

3. Comparison to LayerNorm and QKNorm

Standard PreNorm GPT applies LayerNorm to the block input, then performs attention and MLP projections. QKNorm introduces LayerNorms on $q$ and $k$ prior to the dot product, but still allows the norm of $q$ , $k$ , or $v$ to scale with $W$ .

In contrast, SimpleGPT applies SimpleNorm directly after every linear mapping—attention projections ( $W_q$ , $W_k$ , $W_v$ , $W_o$ ) and feed-forward projections ( $W_1$ , $W_2$ ):

Guarantees $\|W x\|_2 \approx \sqrt d$ for all such outputs.
Removes all block-level LayerNorms.
Prevents scale drift, explosion, or collapse in intermediate activations.

Projection	PreNorm	QKNorm	SimpleGPT
Attention q,k,v	LN + $W$	$W$ + LN (q,k)	$SN$ (all)
FFN	LN + $W_1$ / $W_2$	LN + $W_1$ / $W_2$	$SN$ ( $W_1$ and $W_2$ )
Output	$W_o$ + LN	$W_o$ + LN	$SN$ ( $W_o$ )

The activation norm control at each projection step is unique to SimpleGPT.

4. Geometric Analysis: Hessian Spectral Norm and Stability

Optimization stability is governed by the largest eigenvalue of the Hessian $H_{xx} = \nabla_x^2\ell(x)$ , which bounds the maximum learning rate $\eta_{\max} \leq 2 / \beta$ , where $\beta = \sup_x \|H_{xx}\|_2$ .

SimpleNorm’s forward and backward pass yields:

$H_{xx} = J^\top H_{yy} J + C$

where $J = \frac{\sqrt{d}}{\|W x\|_2} D (I - u u^\top) W$ , and $C$ is a secondary curvature term with norm $\|C\|_2 \leq \frac{3 \kappa^2}{\sqrt d}\|g_y\|_2$ . In high dimension, $\|C\| \ll \|L\|$ .

The spectral norm of the SimpleNorm Hessian is independent of $\|W\|_2$ , making it scale-invariant.
For an ordinary linear layer, the Hessian grows as $\|W\|_2^2$ during training, potentially producing large curvature and instability at high learning rates.

The result is that

$\sup_x \|H_{xx}^{\rm sn}(x)\|_2 \ll \sup_x \|H_{xx}^{\rm lin}(x)\|_2$

A plausible implication is improved training stability in large models as weight norms increase.

5. Stable High Learning Rate Regimes

The theoretical reduction in Hessian spectral norms enables SimpleGPT to tolerate larger stable learning rates, empirically observed as:

$\eta_{\rm sn} \in [3\times,\, 10\times]\,\eta_{\rm standard}$

This matches the predicted reduction in curvature by $\sim \|W\|_2^2$ for large-scale models, allowing orders-of-magnitude gains in learning rate over PreNorm and QKNorm configurations without loss collapse or divergence.

6. Empirical Configurations and Training Protocol

Empirical validations span four model scales, with the following configurations:

Model	Origin	Layers $L$	$d_\mathrm{model}$	$d_\mathrm{ffn}$	Heads	Seq × Batch	Steps	LR (SimpleGPT)	Weight Decay
1B	Llama 2	18	2,048	5,632	16	512×256	200K	$2\times10^{-3}$ (10×)	0.05
1.4B	nanoGPT	48	1,536	6,144	24	1024×512	100K	$6\times10^{-4}$ (3×)	0.10
7B	Llama 2	32	4,096	11,008	32	2048×192	20K/40K/60K	$1\times10^{-3}$ (3×)	0.05
8B	Llama 3	32	4,096	14,336	32	2048×192	20K	$3\times10^{-3}$ (3×)	0.05

Optimizer: AdamW $(\beta_1=0.9,\, \beta_2=0.95)$
Schedule: Cosine LR, bfloat16 precision, A800 GPU hardware
Datasets: C4 (Llama2/3), OpenWebText (nanoGPT 1.4B)
Overhead: Minimal, with $\sim3\%$ additional step time

7. Ablation Studies and Performance Outcomes

Max-LR tolerance, loss improvement, and robustness to hyperparameters are reported:

Max-LR Tolerance (1B Llama2):
- PreNorm diverges at $\eta=2\times10^{-3}$
- PreNorm+QKNorm: stable up to $2\times10^{-2}$ , diverges at $2\times10^{-1}$
- SimpleNorm: stable even at $\eta=2\times10^{-1}$
Training Loss Improvements:

Model & Steps	PreNorm	QKNorm	SimpleGPT	$\Delta$ (SimpleGPT – QKNorm)
1B (200K)	2.478	2.478	2.446	–0.032
1.4B (100K)	3.010	3.010	2.967	–0.043
7B (60K)	2.290	2.290	2.208	–0.082
8B (20K)	2.320	2.320	2.240	–0.080

Learning-Rate Sweep (1B): SimpleGPT matches or outperforms QKNorm across all learning rates, with the margin increasing at higher rates.
Weight Decay Robustness: SimpleGPT’s improvement (loss reduction $~0.05$ –$0.08$) is sustained as weight decay is varied from $0.1$ to $0.03$ in 7B and 8B settings.

In summary, per-projection normalization through SimpleNorm ensures stable activation scales and Hessian conditioning, permitting learning rates $3$– $10\times$ larger and yielding reliable reductions in training loss across the $1$B–$8$B parameter scale, with low computational overhead and resilience to hyperparameter perturbation (Chen et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SimpleGPT: Improving GPT via A Simple Normalization Strategy (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SimpleGPT Architecture.