Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transformers converge to invariant algorithmic cores

Published 26 Feb 2026 in cs.LG and cs.AI | (2602.22600v1)

Abstract: LLMs exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 LLMs govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

Summary

  • The paper demonstrates that transformers converge to low-dimensional invariant algorithmic cores that are both necessary and sufficient for effective performance.
  • It introduces Algorithmic Core Extraction (ACE), a method leveraging control theory to isolate causal subspaces using joint singular value decomposition and ablation studies.
  • Empirical results across tasks such as Markov chain prediction and grammatical agreement validate that these cores enable robust model behavior and targeted model control.

Summary: Transformers Converge to Invariant Algorithmic Cores

Problem Statement and Motivations

The paper "Transformers converge to invariant algorithmic cores" (2602.22600) rigorously investigates the internal computational mechanisms of transformer models. The central issue addressed is the underdetermination of internal structure by functional outcome: numerous parameter configurations can implement identical input-output behaviors, complicating mechanistic interpretability. The study posits that mechanistic explanations should focus on low-dimensional, functionally necessary and sufficient subspaces (algorithmic cores) shared across distinct training runs and architectures, rather than idiosyncratic implementation details.

Methodological Contributions

The authors introduce Algorithmic Core Extraction (ACE), a principled procedure grounded in control theory concepts, specifically reachability and observability, to extract causal subspaces from layer activations. ACE isolates directions that are both highly active and relevant for task output by computing a joint singular value decomposition of activation variances and output sensitivities.

Cores are validated by causal ablations: retaining only the core should preserve performance (sufficiency), while removing it should reduce performance to chance (necessity). The internal dynamics of the core are mechanistically characterized via operator fitting, with spectral analysis revealing the algorithmic structure.

Empirical Results

Markov Chain Learning

Single-layer transformers trained on a Markov-chain next-token prediction task converge to distinct weight configurations but share a compact 3D core that is necessary and sufficient for performance. Although the geometric embeddings of these cores are nearly orthogonal between models, statistical analyses via canonical correlation show near-unity alignment, confirming functional equivalence. Linear operators fit in the core subspace recover the ground-truth Markov chain transition spectrum with strong numerical agreement (within 1%).

Modular Addition and Grokking Dynamics

Two-layer transformers trained on modular addition exhibit the grokking phenomenon: test accuracy spikes abruptly after memorization. ACE reveals that the algorithmic core compresses and crystallizes at grokking, adopting rotational (cyclic) dynamics consistent with Fourier-theoretic interpretations. When training continues with weight decay, the core inflates due to redundancy—more rotational modes are recruited than minimally necessary. The paper formalizes the process with an ODE model showing the grokking delay scales inversely with weight decay and redundancy, experimentally confirmed via scaling law sweeps.

GPT-2 LLMs and Grammatical Agreement

Across GPT-2 Small, Medium, and Large, subject--verb number agreement is governed by a single causal axis localized within the last layers. This axis is necessary, sufficient, and directionally controllable: flipping it inverts grammatical preferences, not merely for targeted next-token predictions but throughout open-ended autoregressive generation. The extracted cores exhibit robust spectral gaps, confirming their one-dimensionality, and core projections align strongly across scales, indicating invariant representation. Strong numerical results are reported: agreement (AUC) is maintained with core-only interventions (>0.97), drops to chance with core-removed (<0.25), and inverts with core-flipped (<0.04).

Theoretical Implications

Structure–Function Degeneracy and System Drift

The findings generalize the principle of functional equivalence: diverse internal parameterizations implement identical algorithmic cores. This mirrors phenomena in biology (degeneracy, system drift), control theory (minimal realizations), and physics (gauge symmetry). Algorithmic core alignment across divergent models reveals substantial drift orthogonality, which has direct implications for neural network merging—the geometric misalignment precludes naive recombination.

Interpretability and World Models

Mechanistic interpretability should prioritize extraction and characterization of invariant cores. The recovered cores often encode structural abstractions of the data-generating process—e.g., cyclic operators for group tasks, spectral signatures for Markov dynamics—aligning with classical system identification and internal model principles.

Redundancy, Regularization, and Generalization

Extended training with weight decay encourages redundant, distributed representations. While regularization aids generalization, it also inflates the core, making interpretability more challenging post-grokking. Theoretical modeling shows that redundancy accelerates the transition to generalization, and there exists an "interpretability window" post-grokking—but before redistribution—where the algorithmic organization is maximally compact.

Practical Directions and Limitations

Algorithmic core extraction is technically grounded, empirically validated, and generalizes across scales and architectures. However, extraction for complex tasks or multifunctional systems requires precise inquiry framing and may not always yield low-dimensional cores. The methodology invites integration with sparse feature methods, cross-layer analysis, and operator-based characterizations.

Outlook for AI Research

The paper empirically substantiates that transformer computations are organized around compact, shared algorithmic cores, which persist across seeds, checkpoints, and architectural scales. Mechanistic control can be achieved by targeting core directions, offering actionable levers for model steering. This paradigm may inform principled diagnostics for model merging, robust feature extraction, and intervention, and suggests that interpretability and controllability should focus on invariants rather than particularities. Future research can explore scaling the framework to frontier multi-task LLMs and deeper hierarchical reasoning, as well as discovering new invariants empirically.

Conclusion

Transformers realize task computations by converging to low-dimensional algorithmic cores that are invariant across independent training runs and robust to architectural scale. These cores are both necessary and sufficient, encapsulate the essential dynamics of the data-generating process, and often admit compact operator characterizations. Interpretability efforts should concentrate on extracting and understanding such invariants—the computational essence—rather than idiosyncratic details of particular realizations. The algorithmic core framework operationalizes this intuition, providing a rigorous and practical foundation for interpretability, control, and future developments in AI model analysis.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 39 likes about this paper.