Convergence in Small Language Models
- Small language models are autoregressive or masked models with fewer parameters that encounter convergence issues due to limited capacity and high gradient noise.
- Empirical findings reveal a four-phase convergence behavior, where rapid initial learning is followed by divergence affecting token-level stability and consistency.
- Interventions such as spectral regularization, architectural adjustments, and tailored training regimes are proposed to improve convergence and robustness in these models.
Small LLMs (SLMs)—defined conventionally as autoregressive or masked LLMs with substantially fewer parameters relative to state-of-the-art LLMs—play an essential practical role due to their efficiency and reduced inference costs. However, they exhibit persistent convergence challenges during both supervised and unsupervised training, stemming from fundamental limitations in capacity, optimization dynamics, and the interaction between data and parameterization. These challenges manifest as slower, less stable, and less reliable acquisition of target distributions compared to larger models, with implications for performance, stability across random initializations, and robustness in downstream tasks.
1. Definitions and Quantitative Metrics for Convergence
Convergence in LLMs is rigorously quantified by the expected per-token Kullback–Leibler (KL) divergence between the output distributions of instances trained from different random seeds. For context , model parameters and , convergence is defined as
where
and is the token set. The overall convergence is the average over held-out evaluation contexts : Alternative granularities involve Centered Kernel Alignment (CKA) between layerwise activations at various checkpoints. The effective rank (ER) and proportional effective rank (PER) measure the spectral entropy of parameters, linking matrix expressivity to convergence (Fehlauer et al., 30 Sep 2025, Martinez et al., 2024).
2. Four-Phase Convergence Behavior and Model-Size Dependence
Across transformer-based architectures and tasks, a repeatable four-phase convergence-divergence trajectory is observed:
- Uniform Phase (initialization): Models emit near-uniform token distributions, with flat and low.
- Sharp-Convergence Phase: Rapidly rising as the model learns marginal (unigram) token frequencies.
- Sharp-Divergence Phase: drops abruptly as context-dependent reasoning emerges, with models diverging by seed despite globally decreasing cross-entropy.
- Slow-Reconvergence Phase: Partial late-stage recovery in , coincident with induction-head formation and improved in-context learning (Fehlauer et al., 30 Sep 2025).
Empirical studies using the Pythia suite (14M–410M parameters) demonstrate that only models exceeding a threshold (between 70M and 160M) exhibit significant slow-reconvergence; smaller models plateau at or near initial convergence levels after divergence, never fully synchronizing their learned distributions. Downstream and masked-LM tasks reproduce this pattern, confirming its robustness across objectives and architectures (Fehlauer et al., 30 Sep 2025).
3. Architectural and Theoretical Roots of Convergence Impediments
Three principal mechanisms underlie the struggle of small LMs to reconverge:
- Insufficient Representational Capacity: Smaller parameter budgets impede learning complex, context-sensitive mappings consistently across seeds.
- Delayed/Weak Induction-Head Emergence: Induction heads, critical for in-context learning and stabilizing outputs, arise slowly or remain underdeveloped in small models.
- Optimization Regimes and Gradient Noise: Learning rate schedules affect all sizes, but smaller models are particularly susceptible to entering regimes with noisy gradients, impeding coalescence in weight space.
Additionally, analyses examining the effective rank of weight matrices demonstrate that layers in larger models stabilize rapidly (CKA 0.7–0.8 early), while those in smaller models remain unstable much longer, with greater layerwise variability and “flipping” before settling (Martinez et al., 2024). Key bottlenecks are observed in late-stage MLP blocks and attention projections, where the PER and PER of gradients are suppressed, funneling learning into a diminished subspace.
4. Linguistic and Token-Specific Factors in Divergence and Reconvergence
Convergence is highly uneven across linguistic categories:
- Token Frequency: Convergence is achieved and maintained more reliably for high-frequency tokens (top 10%), which recover from divergence by bits, while rare tokens degrade below initial levels, indicating lack of stability.
- Part-of-Speech (PoS): Function words exhibit robust reconvergence ( bits), while content words (nouns, adjectives, verbs) remain unstable ( bits), with proper nouns and gerunds especially volatile.
- Surprisal Bins: Tokens with low final surprisal benefit from strong reconvergence, whereas mid- and high-surprisal groups show moderate, clustered convergence profiles (Fehlauer et al., 30 Sep 2025).
Practical implication: Stable small-LM training requires mechanisms to ensure consistent learning of rare and content-bearing tokens across random seeds.
5. Convergence in Reasoning and Sequence-Generation Tasks
Small LMs face pronounced difficulties in converging when tasked with learning long chain-of-thought (CoT) reasoning, chiefly due to capacity constraints.
- Sequence Length Effects: Training on lengthy, redundant CoT sequences increases gradient variance and slows per-token convergence, with SLMs allocating resources to memorizing extraneous “overthinking” rather than key reasoning steps.
- Distillation and Data Pruning: Efficient CoT distillation via binary cutting (logarithmic search), validated on-policy by the SLM itself, reduces average token counts by 39–49% while maintaining nearly all downstream accuracy (drops of 1–4%).
- Gradient Efficiency: Pruned, model-aligned CoT training yields more stable gradients and rapid convergence to accuracy plateaus, reducing wall-clock time and inter-epoch variance (Wang et al., 24 May 2025).
These findings further reinforce that prompt engineering and data curation strategies interact with fundamental convergence limitations in SLMs.
6. Interventions and Prospects for Improved Convergence
Several architectural and algorithmic interventions are validated or proposed:
- Spectral Regularization: Augmenting the loss with a regularizer maximizing the spectral entropy of weight matrices,
directly raises the effective rank, potentially accelerating convergence trajectories.
- Architectural Scaling: Increasing bottleneck dimensions in attention and late-stage MLP layers can preemptively boost PER, expanding the subspace over which the model can update the residual stream.
- Gradient-Noise Modulation: Injecting stochasticity to foster exploration of broader modes in weight space.
- Token- and Seed-Specific Stabilization: Strategies include seed ensembles and targeted regularization for rare or content-bearing tokens, addressing their disproportionate instability (Martinez et al., 2024, Fehlauer et al., 30 Sep 2025).
These approaches are aligned with empirical evidence that convergence speed, stability, and final state quality in SLMs are bottlenecked not solely by raw size but also by the spectral support in their parameter matrices, layerwise optimization dynamics, and the linguistic distribution of training data.
7. Implications for Future Research and Deployment
Convergence challenges in small LLMs highlight non-trivial upper bounds on stability, reproducibility, and performance under parameter and data constraints. Models below ∼100M parameters are particularly vulnerable to failure to synchronize learned distributions across random seeds, with persistent linguistic and topological sources of instability. A plausible implication is the necessity for both principled scaling and targeted regularization in protocol design for robust, stable SLM deployment—especially where model determinism, interpretability, or fairness across linguistic classes is critical. Advances in spectral regularization, data curation, and task-specific pretraining regimens remain promising avenues for mitigating structural convergence bottlenecks in resource-constrained transformer architectures (Fehlauer et al., 30 Sep 2025, Martinez et al., 2024, Wang et al., 24 May 2025).