Convergence in Small Language Models

Updated 31 January 2026

Small language models are autoregressive or masked models with fewer parameters that encounter convergence issues due to limited capacity and high gradient noise.
Empirical findings reveal a four-phase convergence behavior, where rapid initial learning is followed by divergence affecting token-level stability and consistency.
Interventions such as spectral regularization, architectural adjustments, and tailored training regimes are proposed to improve convergence and robustness in these models.

Small LLMs (SLMs)—defined conventionally as autoregressive or masked LLMs with substantially fewer parameters relative to state-of-the-art LLMs—play an essential practical role due to their efficiency and reduced inference costs. However, they exhibit persistent convergence challenges during both supervised and unsupervised training, stemming from fundamental limitations in capacity, optimization dynamics, and the interaction between data and parameterization. These challenges manifest as slower, less stable, and less reliable acquisition of target distributions compared to larger models, with implications for performance, stability across random initializations, and robustness in downstream tasks.

1. Definitions and Quantitative Metrics for Convergence

Convergence in LLMs is rigorously quantified by the expected per-token Kullback–Leibler (KL) divergence between the output distributions of instances trained from different random seeds. For context $\tau$ , model parameters $\theta$ and $\theta'$ , convergence is defined as

$\mathrm{conv}(\tau) = \mathbb{E}_{\theta, \theta'} \left[- D_{KL}(p_\theta(\cdot \mid \tau) \| p_{\theta'}(\cdot \mid \tau)) \right],$

where

$D_{KL}(p_\theta \| p_{\theta'}) = \sum_{t \in \mathcal{S}} p_\theta(t \mid \tau) \log \frac{p_\theta(t \mid \tau)}{p_{\theta'}(t \mid \tau)}$

and $\mathcal{S}$ is the token set. The overall convergence is the average over held-out evaluation contexts $\tau$ : $\overline{\mathrm{conv}} = \mathbb{E}_{\tau}[\mathrm{conv}(\tau)].$ Alternative granularities involve Centered Kernel Alignment (CKA) between layerwise activations at various checkpoints. The effective rank (ER) and proportional effective rank (PER) measure the spectral entropy of parameters, linking matrix expressivity to convergence (Fehlauer et al., 30 Sep 2025, Martinez et al., 2024).

2. Four-Phase Convergence Behavior and Model-Size Dependence

Across transformer-based architectures and tasks, a repeatable four-phase convergence-divergence trajectory is observed:

Uniform Phase (initialization): Models emit near-uniform token distributions, with $\overline{\mathrm{conv}}$ flat and low.
Sharp-Convergence Phase: Rapidly rising $\overline{\mathrm{conv}}$ as the model learns marginal (unigram) token frequencies.
Sharp-Divergence Phase: $\overline{\mathrm{conv}}$ drops abruptly as context-dependent reasoning emerges, with models diverging by seed despite globally decreasing cross-entropy.
Slow-Reconvergence Phase: Partial late-stage recovery in $\overline{\mathrm{conv}}$ , coincident with induction-head formation and improved in-context learning (Fehlauer et al., 30 Sep 2025).

Empirical studies using the Pythia suite (14M–410M parameters) demonstrate that only models exceeding a threshold (between 70M and 160M) exhibit significant slow-reconvergence; smaller models plateau at or near initial convergence levels after divergence, never fully synchronizing their learned distributions. Downstream and masked-LM tasks reproduce this pattern, confirming its robustness across objectives and architectures (Fehlauer et al., 30 Sep 2025).

3. Architectural and Theoretical Roots of Convergence Impediments

Three principal mechanisms underlie the struggle of small LMs to reconverge:

Insufficient Representational Capacity: Smaller parameter budgets impede learning complex, context-sensitive mappings consistently across seeds.
Delayed/Weak Induction-Head Emergence: Induction heads, critical for in-context learning and stabilizing outputs, arise slowly or remain underdeveloped in small models.
Optimization Regimes and Gradient Noise: Learning rate schedules affect all sizes, but smaller models are particularly susceptible to entering regimes with noisy gradients, impeding coalescence in weight space.

Additionally, analyses examining the effective rank of weight matrices demonstrate that layers in larger models stabilize rapidly (CKA $\gtrsim$ 0.7–0.8 early), while those in smaller models remain unstable much longer, with greater layerwise variability and “flipping” before settling (Martinez et al., 2024). Key bottlenecks are observed in late-stage MLP blocks and attention projections, where the PER and PER of gradients are suppressed, funneling learning into a diminished subspace.

4. Linguistic and Token-Specific Factors in Divergence and Reconvergence

Convergence is highly uneven across linguistic categories:

Token Frequency: Convergence is achieved and maintained more reliably for high-frequency tokens (top 10%), which recover from divergence by $\Delta\approx +0.10$ bits, while rare tokens degrade below initial levels, indicating lack of stability.
Part-of-Speech (PoS): Function words exhibit robust reconvergence ( $\mathrm{conv} > 0.6$ bits), while content words (nouns, adjectives, verbs) remain unstable ( $\mathrm{conv} < 0.4$ bits), with proper nouns and gerunds especially volatile.
Surprisal Bins: Tokens with low final surprisal benefit from strong reconvergence, whereas mid- and high-surprisal groups show moderate, clustered convergence profiles (Fehlauer et al., 30 Sep 2025).

Practical implication: Stable small-LM training requires mechanisms to ensure consistent learning of rare and content-bearing tokens across random seeds.

5. Convergence in Reasoning and Sequence-Generation Tasks

Small LMs face pronounced difficulties in converging when tasked with learning long chain-of-thought (CoT) reasoning, chiefly due to capacity constraints.

Sequence Length Effects: Training on lengthy, redundant CoT sequences increases gradient variance and slows per-token convergence, with SLMs allocating resources to memorizing extraneous “overthinking” rather than key reasoning steps.
Distillation and Data Pruning: Efficient CoT distillation via binary cutting (logarithmic search), validated on-policy by the SLM itself, reduces average token counts by 39–49% while maintaining nearly all downstream accuracy (drops of 1–4%).
Gradient Efficiency: Pruned, model-aligned CoT training yields more stable gradients and rapid convergence to accuracy plateaus, reducing wall-clock time and inter-epoch variance (Wang et al., 24 May 2025).

These findings further reinforce that prompt engineering and data curation strategies interact with fundamental convergence limitations in SLMs.

6. Interventions and Prospects for Improved Convergence

Several architectural and algorithmic interventions are validated or proposed:

Spectral Regularization: Augmenting the loss with a regularizer maximizing the spectral entropy of weight matrices,

$L_\text{rank}(W) = -\alpha \sum_k p_k \log p_k,$

directly raises the effective rank, potentially accelerating convergence trajectories.

Architectural Scaling: Increasing bottleneck dimensions $H$ in attention and late-stage MLP layers can preemptively boost PER, expanding the subspace over which the model can update the residual stream.
Gradient-Noise Modulation: Injecting stochasticity to foster exploration of broader modes in weight space.
Token- and Seed-Specific Stabilization: Strategies include seed ensembles and targeted regularization for rare or content-bearing tokens, addressing their disproportionate instability (Martinez et al., 2024, Fehlauer et al., 30 Sep 2025).

These approaches are aligned with empirical evidence that convergence speed, stability, and final state quality in SLMs are bottlenecked not solely by raw size but also by the spectral support in their parameter matrices, layerwise optimization dynamics, and the linguistic distribution of training data.

7. Implications for Future Research and Deployment

Convergence challenges in small LLMs highlight non-trivial upper bounds on stability, reproducibility, and performance under parameter and data constraints. Models below ∼100M parameters are particularly vulnerable to failure to synchronize learned distributions across random seeds, with persistent linguistic and topological sources of instability. A plausible implication is the necessity for both principled scaling and targeted regularization in protocol design for robust, stable SLM deployment—especially where model determinism, interpretability, or fairness across linguistic classes is critical. Advances in spectral regularization, data curation, and task-specific pretraining regimens remain promising avenues for mitigating structural convergence bottlenecks in resource-constrained transformer architectures (Fehlauer et al., 30 Sep 2025, Martinez et al., 2024, Wang et al., 24 May 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Convergence and Divergence of Language Models under Different Random Seeds (2025)

Tending Towards Stability: Convergence Challenges in Small Language Models (2024)

Efficient Long CoT Reasoning in Small Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convergence Challenges in Small Language Models.

Convergence in Small Language Models

1. Definitions and Quantitative Metrics for Convergence

2. Four-Phase Convergence Behavior and Model-Size Dependence

3. Architectural and Theoretical Roots of Convergence Impediments

4. Linguistic and Token-Specific Factors in Divergence and Reconvergence

5. Convergence in Reasoning and Sequence-Generation Tasks

6. Interventions and Prospects for Improved Convergence

7. Implications for Future Research and Deployment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Convergence in Small Language Models

1. Definitions and Quantitative Metrics for Convergence

2. Four-Phase Convergence Behavior and Model-Size Dependence

3. Architectural and Theoretical Roots of Convergence Impediments

4. Linguistic and Token-Specific Factors in Divergence and Reconvergence

5. Convergence in Reasoning and Sequence-Generation Tasks

6. Interventions and Prospects for Improved Convergence

7. Implications for Future Research and Deployment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research