Physics of Skill Learning

Published 21 Jan 2025 in cs.LG, cs.AI, physics.data-an, and stat.ML | (2501.12391v1)

Abstract: We aim to understand physics of skill learning, i.e., how skills are learned in neural networks during training. We start by observing the Domino effect, i.e., skills are learned sequentially, and notably, some skills kick off learning right after others complete learning, similar to the sequential fall of domino cards. To understand the Domino effect and relevant behaviors of skill learning, we take physicists' approach of abstraction and simplification. We propose three models with varying complexities -- the Geometry model, the Resource model, and the Domino model, trading between reality and simplicity. The Domino effect can be reproduced in the Geometry model, whose resource interpretation inspires the Resource model, which can be further simplified to the Domino model. These models present different levels of abstraction and simplification; each is useful to study some aspects of skill learning. The Geometry model provides interesting insights into neural scaling laws and optimizers; the Resource model sheds light on the learning dynamics of compositional tasks; the Domino model reveals the benefits of modularity. These models are not only conceptually interesting -- e.g., we show how Chinchilla scaling laws can emerge from the Geometry model, but also are useful in practice by inspiring algorithmic development -- e.g., we show how simple algorithmic changes, motivated by these toy models, can speed up the training of deep learning models.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Domino effect, demonstrating that neural networks learn skills sequentially as one task’s mastery triggers the next.
It employs three progressively simpler models—Geometry, Resource, and Domino—to explain how task frequency, optimizer dynamics, and resource allocation shape learning.
The findings offer practical insights into neural scaling laws, optimizer performance, and modular design strategies for improved learning efficiency.

This paper explores the "physics" of how skills are learned within neural networks during training, aiming for intuitive, physics-like understanding rather than purely mathematical rigor or engineering solutions. It introduces the "Domino effect" observation, where skills are often learned sequentially, with one skill's learning kicking off just as another finishes, much like falling dominoes. To understand this and related phenomena, the paper proposes three progressively simpler models.

Three Models of Skill Learning

Geometry Model:
- Concept: This model assumes that skills (tasks) are represented by specific directions (task vectors $\bm{t}_i$ ) in the high-dimensional parameter space ( $\bm{\theta}$ ) of a neural network. The skill level $s_i$ for task $i$ is the projection of the parameter change $(\bm{\theta}-\bm{\theta}_0)$ onto its task vector: $s_i = (\bm{\theta}-\bm{\theta}_0)\cdot \bm{t}_i$ .
- Assumptions:
  - Tasks contribute independently to the total loss: $\ell = \sum_{i=1}^{n_{\rm task}} p_i \mathcal{L}(s_i)$ , where $p_i$ is the frequency/importance of task $i$ and $\mathcal{L}$ is a per-task loss function (e.g., MSE $(1-s_i)^2$ or cross-entropy).
  - Task vectors are linearly represented in parameter space. In overparametrized scenarios ( $n_{\rm dim} \ge n_{\rm task}$ ), they can often be treated as orthogonal.
- Implementation: Task vectors $\bm{t}_i$ are typically drawn from a Gaussian distribution. Task frequencies $p_i$ often follow a power law ( $p_i \propto i^{-\alpha}$ ). The model is trained using standard optimizers (SGD, Adam, SignGD).
- Findings: This model reproduces the Domino effect, particularly strongly with the SignGD optimizer. SignGD's element-wise sign operation causes gradients from high-frequency (large gradient magnitude) tasks to dominate, effectively pausing learning for low-frequency tasks until the dominant ones are learned. SGD shows less interference, while Adam is intermediate.
Resource Model:
- Concept: Inspired by the Geometry model (especially with SignGD), this model abstracts away the parameter geometry and focuses on "resource" competition. It posits that the total learning capacity (related to $n_{\rm dim}$ ) is a limited resource allocated among tasks based on their current need (gradient magnitude).
- Dynamics: The rate of learning for unskill $u_i = 1-s_i$ is governed by:
  
  $\frac{du_i}{dt} = -\eta_{\rm eff}\frac{p_iu_i}{(\sum_{j=1}^{n_{\rm task}} p_ju_j)+N_0}$
  
  Here, $\eta_{\rm eff}$ is an effective learning rate (calibrated from the Geometry model, e.g., $\eta_{\rm eff} = 2\sqrt{n_{\rm dim}}\eta_{\rm geo}$ for SignGD), and $N_0$ is a phenomenological parameter representing "wasted" resources (e.g., due to optimization noise, bouncing). Smaller $N_0$ means faster learning. $N_0$ effectively captures details like learning rate, batch size, and optimizer specifics (e.g., $\beta$ values in Adam).
- Underparametrized Regime ( $n_{\rm task} > n_{\rm dim}$ ): The model is extended by introducing a correlation matrix $\bm{C}$ where $C_{ij} = \bm{t}_i \cdot \bm{t}_j$ captures task interference. The dynamics become (for regression):
  
  $\frac{du_i}{dt} = -\eta_{\rm eff}\frac{\sum_{k=1}^{n_{\rm task}} C_{ik} p_k u_k}{(\sum_{j=1}^{n_{\rm task}} p_j u_j) + N_0}$
  
  A similar form exists for classification using cross-entropy gradients. This allows modeling non-monotonic skill learning observed in underparametrized settings.
- Findings: The Resource model successfully replicates the skill dynamics (including sequential learning and non-monotonicity) observed in the Geometry model across various hyperparameter settings by tuning $N_0$ . It provides analytical insights, like conserved quantities ( $u_i^{1/p_i}$ should be constant across tasks $i$ for independent tasks) and learning time scaling.
Domino Model:
- Concept: A further simplification of the Resource model under strong assumptions:
  - No wasted resources ( $N_0=0$ ).
  - Strong task hierarchy ( $p_1 \gg p_2 \gg \dots$ ).
- Dynamics: Predicts strictly sequential learning. Task $i+1$ only starts learning precisely when task $i$ finishes. Each task takes a fixed time $t_0 = 1/\eta_{\rm eff}$ to learn.
- Learning Time: Total time to learn $n_{\rm task}$ skills is $T = n_{\rm task} t_0 \propto n_{\rm task}$ .

Practical Implications and Applications

Neural Scaling Laws:
- The models provide insights into scaling exponents relating loss to model size ( $\alpha_N$ ) and training steps/data ( $\alpha_S$ ).
- Compared to the "Quanta model" $t_n \propto 1/p_n$ $t_{n} \propto 1/ p_{n}$ " title="" rel="nofollow" data-turbo="false" class="assistant-link">michaud2024quantization, the Domino model (assuming $t_n \propto n$ ) predicts $\alpha_S = \alpha_N = \alpha-1$ (where $p_i \propto i^{-\alpha}$ ), which aligns better with some empirical findings like Chinchilla, especially the replication study [besiroglu2024chinchilla] suggesting $\alpha_S \approx \alpha_N$ .
- Crucially, the Geometry model in the underparametrized regime can produce non-zero scaling exponents even for $\alpha=1$ (Zipfian distribution), predicting $\alpha_N \approx 0.34$ , matching Chinchilla [hoffmann2022training], whereas the Quanta and Domino models predict $\alpha_N=0$ .
- Experiments on multitask sparse parity show that optimizer choice (Adam betas) affects measured scaling exponents, reinforcing the importance of optimization dynamics.
Optimization:
- Understanding Optimizers: The models, particularly the Geometry model, serve as useful, interpretable testbeds for optimizers.
- They show adaptive, element-wise optimizers (like SignGD, Adam) are sensitive to the alignment between task directions and parameter axes, unlike SGD. Non-basis-aligned quadratic losses demonstrate the Domino effect with SignGD but not SGD.
- Applying SignGD (Adam with $\beta_1=\beta_2=0$ ) to a grokking task (modular addition) showed significantly faster generalization than standard Adam, suggesting simplified models can inspire practical optimizer choices.
- The Geometry model helps analyze newer optimizers like AdEMAMix pagliardini2024ademamix and Lion chen2024symbolic, reproducing their characteristic behaviors and linking them to skill dynamics under noise and task interference.
- Data Reweighting: The observation that low-frequency tasks learn later suggests reweighting data. Using loss as a proxy for frequency (weighting high-loss examples more) showed speedups in early stages of GPT-2 pre-training, though potential issues with noisy tokens in later stages were noted. The Resource model's conserved quantities ( $\ell^{1/(2p_i)}=C$ ) offer a theoretical basis for inferring frequencies from losses.
Task Compositionality:
- The Resource model can be readily extended to model dependent tasks by introducing a modulating function $B_i(u_1, \dots, u_{n_{\rm task}})$ that determines when a task can start learning based on the status of its prerequisites.
- Example: $B_k(u_i, u_j) = (1-u_i)^\gamma (1-u_j)^\gamma$ models task $k$ needing tasks $i$ AND $j$ to be learned.
- This was shown to capture the dynamics of learning composite sparse parity ( $y_3 = y_1 \oplus y_2$ ) better than assuming independence. The framework allows simulating dynamics for arbitrary task dependency graphs.
Modularity:
- The Domino model provides a clear rationale for modularity benefits. In a non-modular network, tasks compete for the same resources ( $n_{\rm dim}$ ), leading to sequential learning ( $T_N \propto n_{\rm task} / \sqrt{n_{\rm dim}}$ ).
- In a perfectly modular network where parameters are partitioned ( $n'_{\rm dim} = n_{\rm dim} / n_{\rm task}$ per module), tasks learn in parallel within their dedicated resources, albeit with slower individual rates. The total time becomes $T_M \propto \sqrt{n_{\rm task}} / \sqrt{n_{\rm dim}}$ .
- This theoretical speedup ( $T_M \ll T_N$ for large $n_{\rm task}$ ) is analogous to Grover's algorithm speedup ( $O(\sqrt{N})$ vs $O(N)$ ).
- The Geometry model simulations confirm modular networks learn faster in later stages, especially when $n_{\rm dim}$ isn't excessively large.
- An MLP experiment predicting $(x^2, y^2)$ with sparse $y$ showed that a non-modular MLP exhibited sequential learning ( $t_2 \approx 2t_1$ ), while a modular MLP learned both tasks roughly in parallel ( $t_2 \approx t_1$ ).

Conclusion and Limitations

The paper offers a hierarchy of physics-inspired models (Geometry, Resource, Domino) that simplify the complex dynamics of skill learning in neural networks. These models provide intuitive explanations for phenomena like the Domino effect and yield practical insights into scaling laws, optimizer behavior, task dependencies, and the benefits of modularity.

Limitations include not modeling overfitting/generalization gaps, assuming convex local landscapes within the Geometry model, and not providing a concrete method to map real-world data/tasks onto the abstract "skill" representation. Despite these, the models are presented as useful tools for building intuition, benchmarking optimizers, and inspiring algorithmic ideas like data reweighting and modular architectures. The core message advocates for using appropriately simplified models to understand complex systems like neural networks.

Markdown Report Issue