SimpleGPT Architecture: SimpleNorm Innovation
- SimpleGPT is a GPT-style decoder-only Transformer that applies SimpleNorm after every linear projection to maintain fixed activation norms.
- SimpleNorm normalizes outputs to invariant scales, linking activation scaling with Hessian spectral norms to allow substantially higher learning rates.
- Empirical results on models from 1B to 8B parameters show improved training stability and loss reduction compared to PreNorm and QKNorm approaches.
SimpleGPT is a GPT-style decoder-only Transformer architecture distinguished by the systematic application of a novel normalization operator, SimpleNorm, immediately after every linear projection. Designed to address optimization instabilities intrinsic to large-scale Transformer training, SimpleGPT leverages insights from second-order geometry to link architectural choices, activation scaling, Hessian spectral norm, and maximum stable learning rate. Its empirical instantiations demonstrate substantial improvements in training stability and final loss across multiple model scales (1B–8B), and permit learning rates up to an order of magnitude larger than conventionally used methods, including PreNorm and QKNorm (Chen et al., 1 Feb 2026).
1. Model Structure and Forward Pass
SimpleGPT inherits the classic decoder-only Transformer backbone. Its key architectural aspects are:
- Token Embeddings: for vocabulary . Positional representations include RoPE with base (-dimensional queries, Llama2-based 1B/7B), RoPE with (Llama3-based 8B), and learned positional embeddings for 1.4B parameter (nanoGPT-based) models.
- Transformer Blocks: A stack of blocks, each comprising attention and feed-forward sublayers. Unlike standard PreNorm approaches, no block-level LayerNorm is applied at input. Instead, every linear mapping within each sublayer is immediately followed by SimpleNorm.
- Output Head: A single linear projection , tied or untied to the embedding, producing logits for cross-entropy loss.
A block’s computation involves:
- Attention sublayer:
- Residual:
- Feed-forward sublayer:
SimpleNorm (see Section 2) is denoted above.
2. SimpleNorm Operator: Mathematical Formulation
SimpleNorm directly normalizes every linear projection. For an input , weight , and learnable scaling :
- is a per-dimension scaling parameter (as in RMSNorm).
- The factor ensures activation norm .
- Equivalently:
- , with
This ensures that all linear projections have outputs with a fixed norm controlled only by , providing invariant scaling regardless of .
3. Comparison to LayerNorm and QKNorm
Standard PreNorm GPT applies LayerNorm to the block input, then performs attention and MLP projections. QKNorm introduces LayerNorms on and prior to the dot product, but still allows the norm of , , or to scale with .
In contrast, SimpleGPT applies SimpleNorm directly after every linear mapping—attention projections (, , , ) and feed-forward projections (, ):
- Guarantees for all such outputs.
- Removes all block-level LayerNorms.
- Prevents scale drift, explosion, or collapse in intermediate activations.
| Projection | PreNorm | QKNorm | SimpleGPT |
|---|---|---|---|
| Attention q,k,v | LN + | + LN (q,k) | (all) |
| FFN | LN + / | LN + / | ( and ) |
| Output | + LN | + LN | () |
The activation norm control at each projection step is unique to SimpleGPT.
4. Geometric Analysis: Hessian Spectral Norm and Stability
Optimization stability is governed by the largest eigenvalue of the Hessian , which bounds the maximum learning rate , where .
SimpleNorm’s forward and backward pass yields:
where , and is a secondary curvature term with norm . In high dimension, .
- The spectral norm of the SimpleNorm Hessian is independent of , making it scale-invariant.
- For an ordinary linear layer, the Hessian grows as during training, potentially producing large curvature and instability at high learning rates.
The result is that
A plausible implication is improved training stability in large models as weight norms increase.
5. Stable High Learning Rate Regimes
The theoretical reduction in Hessian spectral norms enables SimpleGPT to tolerate larger stable learning rates, empirically observed as:
This matches the predicted reduction in curvature by for large-scale models, allowing orders-of-magnitude gains in learning rate over PreNorm and QKNorm configurations without loss collapse or divergence.
6. Empirical Configurations and Training Protocol
Empirical validations span four model scales, with the following configurations:
| Model | Origin | Layers | Heads | Seq × Batch | Steps | LR (SimpleGPT) | Weight Decay | ||
|---|---|---|---|---|---|---|---|---|---|
| 1B | Llama 2 | 18 | 2,048 | 5,632 | 16 | 512×256 | 200K | (10×) | 0.05 |
| 1.4B | nanoGPT | 48 | 1,536 | 6,144 | 24 | 1024×512 | 100K | (3×) | 0.10 |
| 7B | Llama 2 | 32 | 4,096 | 11,008 | 32 | 2048×192 | 20K/40K/60K | (3×) | 0.05 |
| 8B | Llama 3 | 32 | 4,096 | 14,336 | 32 | 2048×192 | 20K | (3×) | 0.05 |
- Optimizer: AdamW
- Schedule: Cosine LR, bfloat16 precision, A800 GPU hardware
- Datasets: C4 (Llama2/3), OpenWebText (nanoGPT 1.4B)
- Overhead: Minimal, with additional step time
7. Ablation Studies and Performance Outcomes
Max-LR tolerance, loss improvement, and robustness to hyperparameters are reported:
- Max-LR Tolerance (1B Llama2):
- PreNorm diverges at
- PreNorm+QKNorm: stable up to , diverges at
- SimpleNorm: stable even at
- Training Loss Improvements:
| Model & Steps | PreNorm | QKNorm | SimpleGPT | (SimpleGPT – QKNorm) |
|---|---|---|---|---|
| 1B (200K) | 2.478 | 2.478 | 2.446 | –0.032 |
| 1.4B (100K) | 3.010 | 3.010 | 2.967 | –0.043 |
| 7B (60K) | 2.290 | 2.290 | 2.208 | –0.082 |
| 8B (20K) | 2.320 | 2.320 | 2.240 | –0.080 |
- Learning-Rate Sweep (1B): SimpleGPT matches or outperforms QKNorm across all learning rates, with the margin increasing at higher rates.
- Weight Decay Robustness: SimpleGPT’s improvement (loss reduction –$0.08$) is sustained as weight decay is varied from $0.1$ to $0.03$ in 7B and 8B settings.
In summary, per-projection normalization through SimpleNorm ensures stable activation scales and Hessian conditioning, permitting learning rates $3$– larger and yielding reliable reductions in training loss across the $1$B–$8$B parameter scale, with low computational overhead and resilience to hyperparameter perturbation (Chen et al., 1 Feb 2026).