Delayed Generalization (Grokking)

Updated 22 February 2026

Delayed generalization (grokking) is a phenomenon where a model quickly memorizes training data yet shows stagnant test performance until a sudden phase transition boosts generalization.
It unfolds in distinct phases including rapid training accuracy gain, a prolonged overfitting plateau, and an abrupt generalization phase marked by significant changes in network structure.
Quantitative early-warning signals such as commutator defects and network sparsity, along with structural analyses, inform training protocols for enhanced model robustness.

Delayed generalization, widely known in the literature as "grokking," refers to the phenomenon wherein a machine learning model attains near-maximal training accuracy after a relatively short period of optimization, but the corresponding test accuracy remains at or near chance level for a profoundly extended period before an abrupt transition to strong generalization. This effect has been identified across diverse architectures and problem domains, including fully connected discriminators, transformers, linear models, generative diffusion models, and reinforcement learning agents. The hallmark of delayed generalization is a protracted regime where test performance stagnates after overfitting, terminating in a rapid, discontinuous improvement of out-of-distribution accuracy.

1. Formal Definitions and Canonical Signatures

Let $A_{\text{train}}(t)$ and $A_{\text{test}}(t)$ denote the training and test accuracies, respectively, at epoch or training step $t$ . Delayed generalization (grokking) is characterized by the following sequence:

$A_{\text{train}}(t)$ rapidly approaches its maximum ( $\sim$ 100%) at some early point $t_1$ .
$A_{\text{test}}(t)$ remains near-chance (e.g., 25% for uniform four-class classification) or at an overfitting plateau through a prolonged interval $t_1 < t < t_2$ .
At a critical time $t_2 \gg t_1$ , $A_{\text{test}}(t)$ abruptly elevates to its asymptotic value, $A_{\text{test}}(\infty)$ , often observed as a sharp phase transition.

The delay $\Delta t = t_2 - t_1$ captures the essence of this lag. In standard deep learning, train and test accuracy typically track each other; grokking breaks this synchrony via an extended train–test gap (Hutchison et al., 29 Oct 2025, Singh et al., 6 Nov 2025, Minegishi et al., 2023). This generalizes to diffusion generative models, where validation loss or out-of-sample novelty/minimal copy rate reach their optimal value significantly ahead of measurable overfitting or memorization (Favero et al., 22 May 2025).

2. Mechanistic Theories: Phases, Transitions, and Network Restructuring

Delayed generalization universally exhibits a two- or three-phase dynamical progression:

Rapid fit ("memorizing" phase): Training loss $L_{\text{train}}(t)$ decays rapidly to zero, while test loss $L_{\text{test}}(t)$ may plateau or rise slightly, corresponding to classical overfitting.
Extended compression or exploration phase: Training loss remains flat and low; test loss stagnates. Internally, network weights or decision boundaries undergo slow, high-dimensional reorganization, either compressing the representation (Koch et al., 17 Apr 2025, Humayun et al., 2024) or exploring candidate subnetworks (Minegishi et al., 2023).
Sharp generalization ("grokking" phase): A critical transition rapidly decreases $L_{\text{test}}(t)$ and increases $A_{\text{test}}(t)$ , coinciding with reduced network complexity, clarified representations, and emergent modularity or sparsity (Hutchison et al., 29 Oct 2025, Liang et al., 3 Dec 2025).

For example, principal component analyses of weights and activations show that the effective rank of hidden-layer connectivity drops dramatically at the grokking transition, producing a sparse subnetwork suited for generalization (Hutchison et al., 29 Oct 2025). In the lottery ticket framework, the "grokking ticket"—a sparse subgraph discovered late in training—enables rapid generalization; the plateau period is interpreted as a combinatorial search for this subnetwork (Minegishi et al., 2023). Circuit-based metrics (local complexity, linear mapping number) likewise decrease during delayed generalization phases, corresponding to smoothing of the function and alignment with test geometry (Koch et al., 17 Apr 2025, Humayun et al., 2024).

3. Quantitative Phase Transition and Early-Warning Signals

Several studies formalize the transition to generalization as a physical or geometric phase transition:

Order parameters: Sparsity (fraction of active weights), average node degree, and confidence gaps are used as analogs of thermodynamic variables, exhibiting discontinuous changes at $t_2$ . This provides direct analogies to liquid–solid or order–disorder transitions (Hutchison et al., 29 Oct 2025).
Commutator defect: Loss-landscape curvature, measured by the normalized commutator of two mini-batch gradients, provides a parameter-free, architecture-agnostic early-warning signal of imminent grokking. Empirically, the commutator defect $D(t)$ rises before test accuracy surges, with a deterministic lead time exhibiting superlinear scaling laws with respect to the total grokking time (Xu, 19 Feb 2026).

Experiments show that boosting transverse curvature (by adding orthogonal noise to optimization steps) can accelerate grokking, while suppressing it can prevent the transition. This demonstrates a mechanistic–not merely correlative–relationship between curvature and delayed generalization (Xu, 19 Feb 2026).

4. Statistical and Structural Factors Governing Onset

The necessary and sufficient conditions for grokking are not restricted to regularization strength or data sparsity. Rather, distribution shift between training and test sets is the central driver (Carvalho et al., 3 Feb 2025). Even with dense data and minimal hyperparameter tuning, controlled sub-class sampling to induce structural shift is sufficient to trigger prolonged memorization and delayed test-set improvement.

Network architecture and structural properties are pivotal: delayed generalization can occur in deep or shallow networks, MLPs, transformers, or even in linearly separable logistic regression, as long as the underlying test distribution, support vector imbalance, or class asymmetry creates an extended period during which the solution cannot yet generalize (Das et al., 9 Feb 2026, Humayun et al., 2024).

A general scaling law relates dataset size to grokking delay: in overparameterized generative models, memorization time grows linearly with dataset size, and the generalization/overfitting boundary grows accordingly, allowing precise early stopping (Favero et al., 22 May 2025).

5. Delayed Generalization in Knowledge Transfer and Data Scarcity

Grokking is critically dependent on data coverage: below a "critical sample threshold" ( $N_c$ ), generalization does not occur regardless of training time. However, knowledge distillation (KD) can induce or accelerate grokking even under data scarcity or domain shift. Distillation from a pre-grokked teacher provides dense, structured guidance that biases the student toward generalizing "circuits," raises the effective data efficiency, and can entirely remove the delayed generalization interval (Singh et al., 6 Nov 2025).

In continual, multi-distribution, or joint-training setups, KD preserves performance on earlier tasks and enables monotonic adaptation to new domains, bypassing grokking delays and catastrophic forgetting (Singh et al., 6 Nov 2025).

6. Theoretical and Statistical Frameworks

Delayed generalization has been linked to the information bottleneck: the compression or coarse-graining phase corresponds to reducing mutual information between activations and raw inputs, while retaining information between activations and outputs. Analytical models describe the memorization-to-generalization transition as a competition between two sub-circuits: one that memorizes by rote storage of mappings, and another that generalizes by on-the-fly inference. The critical threshold for switching between these sub-circuits grows as a power law of training diversity, governed by scaling exponents determined by network and task statistics (Nguyen et al., 2024).

In linear models, grokking emerges purely from the late-time evolution of the bias term, with three-phase dynamics (population-dominated, unlearning, and re-generalization). The grokking delay depends exponentially on class imbalance in support vectors (Das et al., 9 Feb 2026).

7. Applications, Diagnostics, and Open Directions

Delayed generalization is not merely a theoretical curiosity: it has profound implications for training diagnostics and optimization:

Early stopping: In generative modeling, the optimal stopping time for maximal generalization precedes memorization and scales linearly with dataset size (Favero et al., 22 May 2025).
Model unlearning: Grokked models, having developed modular and orthogonal representations, prove to be better unlearners—allowing efficient, selective forgetting of targeted data with reduced collateral damage and improved stability (Liang et al., 3 Dec 2025).
Training protocols: Delayed or infrequent policy updates in actor-critic DRL (delayed policy updates) improve test-time robustness and generalization to novel environments, substituting temporal smoothing of the actor for direct regularization (Grando et al., 2024).
Universal observability: Geometry-based measures (local complexity, commutator defect, mutual information) provide label-free indicators of impending generalization, which can supplement or replace traditional learning-curve diagnostics (Humayun et al., 2024, Koch et al., 17 Apr 2025, Xu, 19 Feb 2026).

Delayed generalization remains sensitive to model architecture (e.g., batch normalization abolishes the phenomenon by redefining the local geometry), regularization schedules, and the structure of the learning problem. Its implications span the design of robust, adaptive models, the analysis of training dynamics, and potential avenues for controlling generalization via structural, statistical, or procedural interventions.