Training Dynamics Across Model Sizes

Updated 4 February 2026

Training dynamics across model sizes are characterized by scaling laws that yield universal loss curves when compute and loss are normalized.
Methodologies leverage curriculum alignment and teacher-student frameworks to control learning phases and emergent representational shifts.
Research indicates larger models memorize data faster and achieve higher generalization thresholds through optimized architecture and compute-efficient designs.

Training dynamics across model sizes refer to the evolution of loss, internal representations, optimization trajectories, emergent functional capacities, and stability during the training of neural networks, with a particular focus on how these properties are controlled, shaped, or predicted by parameter count and associated scaling laws. This topic encompasses both practical regimes (e.g., compute-constrained scaling) and theoretical frameworks (e.g., teacher-student models, power-law universality). Research has established that while scaling up size amplifies memorization speed, capacity, and generalization, critical qualitative changes often arise not from model size alone, but through the alignment between curriculum, architecture, and training schedule.

1. Scaling Laws and Universal Loss Dynamics

Empirical and theoretical studies have established that, for a given architecture and data distribution, final model performance exhibits predictable scaling with both parameter count $N$ and training time or token count $D$ —typically as power laws of the form

$L(N, D) = E + A N^{-\alpha} + B D^{-\beta}$

where $E$ is the irreducible loss, $A, B, \alpha, \beta$ are constants determined by architecture and data (Bordelon et al., 2024, Inbar et al., 2024).

In the compute-optimal regime, where training time and model size are increased in tandem according to scaling-optimal schedules, the entire loss trajectory (not just the final loss) collapses to a universal curve when both compute and loss are normalized to unity at the end of training: $\ell(x) = \frac{L(x t^*(p), p) - L_0}{L(t^*(p), p) - L_0}$ where $x = t/t^*(p)$ is normalized compute (Qiu et al., 2 Jul 2025). This scaling collapse indicates that loss curves for different sizes are quantitatively identical when hyperparameters are scaled optimally. With learning-rate decay, the collapse becomes so tight that seed-to-seed fluctuations within a given model exceed model-to-model deviations (“supercollapse”).

Foundationally, power-law exponents for width and time generally differ; theory predicts the scaling exponents with width and training time are

$\alpha_{\rm width} = a-1, \quad \alpha_{\rm time} = \frac{a-1}{b}$

where $a$ , $b$ are decay exponents for the ground-truth signal and kernel eigenvalues respectively. This induces asymmetric scaling: the compute-optimal number of steps rises faster than parameter count with increasing resources (Bordelon et al., 2024).

2. Learning Phases and Representational Evolution

Across model sizes, training proceeds through distinct and largely size-invariant phases. Examination of decoder-only transformer pretraining trajectories (e.g., PolyPythias: 14M–410M) via downstream accuracy, representational probes, and parameter-statistic HMM states, reveals a progression:

Initial emergence of non-random semantics (phase transition at $t \approx 10^3$ )
Rapid critical learning ( $10^3$ – $10^4$ steps)
Slow refinement and saturation ( $10^5$ steps and up)

These phase boundaries are robust to model size, with major representational and performance shifts occurring at roughly the same normalized compute points (Wal et al., 12 Mar 2025). Principal subspace angles, macro-F1 scores, and other probe metrics display trajectories with $>0.98$ correlation across seeds and scales. Outlier runs (seed-specific degenerations) are rare and typically diagnosed by reversion in training-map state sequences.

3. Memorization, Forgetting, and Overfitting Dynamics

Detailed memorization analyses across sizes (125M–13B) reveal that larger models memorize training data more quickly (fewer epochs to reach fixed recall threshold), attain higher memorization fractions before overfitting is detectable, and show reduced forgetting of injected minibatches. For exact memorization $M(f)$ , time-to-threshold satisfies $T(N,\tau) \propto N^{-\alpha}$ with $\alpha\sim0.3$ –$0.5$ (Tirumala et al., 2022). The onset of overfitting occurs at higher memorization for larger $N$ ; forgetting baselines (asymptotic memorization of a held-out batch) likewise rise with $N$ . Models preferentially memorize nouns and numbers first, highlighting token-identity as a learned pointer.

A plausible implication is that scaling up model size promotes stable long-range memory, pushing the interpolation threshold between memorization and generalization deeper into training.

4. Transferability of Training Dynamics and Curriculum Effects

Training dynamics, including difficulty, confidence, and ambiguity per instance, are highly transferable across model sizes and even pretraining methods. Small reference models can robustly identify the ambiguous or hard-to-learn data subset for large models, enabling efficient and computation-saving fine-tuning via approaches like FTFT (“Fine-Tuning by transFerring Training dynamics”) (Du et al., 2023). This transfer is effective provided the small model achieves moderate in-domain accuracy (e.g., $\geq 85\%$ ). When applied, main models trained only on the selected ambiguous subset outperform empirical risk minimization (ERM) at up to $50\%$ reduced compute.

Training curriculum has a decisive impact at small model sizes and on tasks with algorithmic structure. For example, in symbolic ListOps tasks:

The threshold parameter count $P_{\min}({\rm C})$ at which 50% accuracy is achieved follows a power-law, but depends not only on task “Kolmogorov complexity” but on the curriculum mix.
Curriculum scaffolding with jointly trained easier operations (e.g., MAX, MED) reduces parameter requirements for harder ones (e.g., SUM), induces number-ordered embeddings, and robust parity (odd/even) detection (Both et al., 23 May 2025).
Pretraining on a joint (easier) curriculum can induce learning of a harder task in models orders of magnitude below the solo learning threshold.

This suggests algorithmic complexity, as imposed by curriculum, can dominate size in determining training dynamics and emergent abilities.

5. Emergent Behaviors, Internal Representations, and Outlier Phenomena

Structural features of training—include emergence of internal representations, outlier activations, and self-organization—display size-dependent but highly predictable trajectories. For instance, the development of massive activations (MAs; extreme scalar outliers in transformer hidden states) can be captured by a five-parameter, exponentially-modulated logarithmic function, whose coefficients are themselves predictable from design specifications such as attention density and width/depth ratios (Gallego-Feliciano et al., 5 Aug 2025). The asymptotic baseline of these activations is highly correlated ( $R^2\approx0.85$ ) with architecture, and the schedule of their emergence (latency, magnitude) is prolonged with increasing size.

Table: Massive Activation Trajectory Parameters vs. Model Size (Gallego-Feliciano et al., 5 Aug 2025) | Model Size | Amplitude (A) | Decay ( $\lambda$ ) | Asymptote (K) | |------------|---------------|-------------------|---------------| | 14M | 0.12 | 0.31 | 83 | | 1B | 0.92 | 0.17 | 2,350 | | 12B | 3.10 | 0.08 | 3,200 |

Massive activations stabilize more quickly and with lower fluctuation in larger nets. Adjusting architecture (e.g., lowering attention density) enables a priori control of MA statistics, with implications for quantization and interpretability.

6. Domain Specialization and Compute-Efficient Scaling

The interplay of model size and domain specialization in continued pretraining shows that, under fixed-compute constraints, larger specialized models achieve both lower perplexity and greater compute efficiency than general-domain equivalents. For example, a 14B-parameter model achieves a specialized-to-general efficiency ratio (SGER) of $4.3\times$ relative to its general-pretrained counterpart when reaching the same domain loss (Junior et al., 3 Jan 2025). As model size grows, per-parameter efficiency gains via specialization magnify. Additionally, larger models suffer less catastrophic forgetting on general-domain data even as they integrate specialized knowledge, indicating an improved balance between memorization and knowledge retention.

7. Practical Implications and Architectural Recommendations

Empirical runtime and scaling models that incorporate actual wall-clock time (dominated by memory access, not FLOPs) show that, under real hardware constraints and Chinchilla-style token–parameter tradeoffs, wider and shallower models outperform deeper, narrower ones per fixed budget (Inbar et al., 2024). Wider embedding dimensions increase throughput and training speed, while deeper architectures incur higher memory-copy overheads, reducing compute efficiency.

Optimization guidance for wall-clock-limited regimes:

Prioritize expanding embedding width $d$ and keeping moderate depth $n$ and MLP width $w$
Use gradient optimization on a precisely calibrated proxy loss function balancing parameter count and actual trainable tokens
Adjust architectural design (e.g., attention density, MLP width) to manage dynamics of outlier activations

In practice, leveraging curriculum, transferring example-level difficulty from small models, and optimizing architecture for compute efficiency are principal levers for aligning training dynamics across sizes and regimes.

References: