Breaking Data Symmetry is Needed For Generalization in Feature Learning Kernels

Published 31 Mar 2026 in stat.ML and cs.LG | (2604.00316v1)

Abstract: Grokking occurs when a model achieves high training accuracy but generalization to unseen test points happens long after that. This phenomenon was initially observed on a class of algebraic problems, such as learning modular arithmetic (Power et al., 2022). We study grokking on algebraic tasks in a class of feature learning kernels via the Recursive Feature Machine (RFM) algorithm (Radhakrishnan et al., 2024), which iteratively updates feature matrices through the Average Gradient Outer Product (AGOP) of an estimator in order to learn task-relevant features. Our main experimental finding is that generalization occurs only when a certain symmetry in the training set is broken. Furthermore, we empirically show that RFM generalizes by recovering the underlying invariance group action inherent in the data. We find that the learned feature matrices encode specific elements of the invariance group, explaining the dependence of generalization on symmetry.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper shows that breaking symmetry in training data is critical for enabling generalization in feature learning kernels using the Recursive Feature Machine.
The study employs group-theoretic analysis to reveal that disrupting subgroup invariance through random data removal restores proper generalization.
The results emphasize that careful data partitioning aligning with kernel symmetry and task structure is essential to avoid symmetry traps that hinder learning.

Data Symmetry and Generalization in Feature Learning Kernels

Overview

This paper explores the relationship between data symmetry and generalization within the context of feature learning kernels, specifically using the Recursive Feature Machine (RFM) algorithm. The authors investigate the phenomenon of grokking—where models achieve high training accuracy but only generalize after extensive additional training—on algebraic tasks such as modular arithmetic and binary operations in Abelian groups. Through a group-theoretic lens, the paper demonstrates that the generalization behavior of RFM is fundamentally governed by the symmetry structure of the training data and the target function, and that breaking this symmetry is essential for generalization.

Grokking and Symmetry Groups in Algebraic Tasks

The paper builds on prior studies of grokking in neural networks and feature learning kernels, extending the analysis to algebraic functions characterized by an explicit symmetry group. Binary operations in Abelian groups (e.g., modular addition) are examined, where the symmetry group of the function can transform any data point with a given label into any other with the same label. RFM, through iterative AGOP-based feature updates, is shown to learn representations closely linked to the symmetry group governing the target function. This group-theoretic approach provides a rigorous framework for understanding how selected data partitions may induce “symmetry traps” that inhibit generalization.

Experimental Characterization: Symmetry Breaking and Generalization

Empirical results establish that generalization in RFM occurs only when the symmetry of the training data is broken. When the train-test partition is invariant under a subgroup of the full symmetry group (e.g., dihedral group $D_{2p}$ for modular addition with $p$ prime), RFM fails to generalize to the test set; moving even a single random training point into the test set (thus breaking invariance) enables generalization.

Figure 1: Gaussian (top) and quadratic (bottom) kernels trained for addition modulo $p=53$ on symmetric partitions; random removal of points breaks symmetry and allows generalization, whereas removal by symmetric pairs does not.

Symmetry breaking is further studied through:

Random vs. symmetric removal of data points: Removing points randomly from the training set disrupts subgroup invariance and rapidly restores generalization. Removing points in symmetric pairs does not break invariance and generalization remains absent.
Fixed points and orbits: By withholding fixed points under specific reflections (e.g., $sr^k$ ) from the training set, test accuracy falls to zero. Breaking this symmetry by moving random points yields near-perfect generalization.

Feature Matrix Structure and Group Representations

The learned features (via AGOP) are shown to encode group-theoretic structure. When trained on random partitions, RFM recovers block-circulant feature matrices, mapping closely to permutation representations of the symmetry group. With non-generalizing partitions (e.g., withholding fixed points), the feature matrices exhibit support primarily aligned with the subgroup elements of the symmetry group.

Figure 2: Feature matrices learned by RFM on modular arithmetic tasks align with group-theoretic permutation representations of the relevant symmetries.

Extensive comparisons are made between the AGOP matrices produced by RFM and theoretical permutation representations, confirming that the kernel can internalize only those symmetries present in the training data.

Subgroup Structure and Generalization Orbits

The failure to generalize is further explored through other dihedral subgroups. Data partitions invariant under subgroups of $D_{2p}$ (e.g., larger dihedral subgroups in non-prime $p$ ) also inhibit generalization. The precise characterization of fixed points, reflection pairs, and orbits is analytically developed for both cyclic and Abelian group settings.

Figure 3: AGOPs for addition mod 32 trained on partitions excluding fixed points of dihedral subgroups. RFM does not generalize under these settings, confirming theoretical predictions.

Initialization and Generalization to Group Orbits

By initializing the feature matrix to encode a particular subgroup symmetry (e.g., matrix aligned with a fixed reflection), RFM is constrained to generalize only within the orbit of that subgroup action. Empirical results show perfect correspondence between theoretical orbit predictions and actual test outputs.

Figure 4: RFM initialized with feature matrix encoding a subgroup generalizes only to points within the orbit of that subgroup action; precision and recall are both 100% for theoretical prediction vs. observed outcomes.

Theoretical Implications and Directions

The paper’s analysis unifies group-theoretic representation theory and empirical results for feature learning kernels. It rigorously demonstrates that data symmetry fundamentally determines generalization—subgroup invariance in training partitions induces symmetry traps, and breaking invariance releases these traps. This insight applies not only to kernels, but also invites comparison with neural network grokking and regimes in which feature learning transitions from “lazy” to “rich” dynamics.

Practically, the results advocate for careful design of data partitions in learning tasks with high symmetry; data with symmetry traps can dramatically inhibit generalization despite the use of sophisticated feature learning mechanisms. The group-theoretic approach provides predictive power for such behaviors, and suggests that symmetries in initialization and kernel choice must be matched with the structure of the learning task to avoid failures in generalization.

Conclusion

This paper provides a detailed analysis of generalization failures and successes in feature learning kernels, rooted in the symmetry structure of algebraic tasks. Using the Recursive Feature Machine and AGOP, the authors demonstrate that breaking data symmetry is necessary for generalization; partitions invariant under symmetry group subgroups prevent the kernel from learning beyond the training set. The learned features are shown to encode group-theoretic representations, and the theoretical predictions regarding orbits, fixed points, and subgroup structures match empirical outcomes precisely. This work establishes a rigorous connection between group symmetry, data partitioning, and generalization in feature learning, and sets the stage for further theoretical development and extension to other models and learning regimes.

Markdown Report Issue