Catastrophic Forgetting in LLMs
- Catastrophic forgetting is a phenomenon where LLMs lose past-task performance due to destructive gradient interference and representational drift during sequential fine-tuning.
- Quantitative analyses reveal that lower-layer attention heads face up to 67% disruption and a 2.4-fold increase in forgetting when gradient alignment is negative.
- Mitigation techniques such as targeted regularization and representational realignment can restore up to 47% of lost performance while maintaining new-task competence.
Catastrophic forgetting in LLMs denotes a persistent reduction in previously acquired performance when models undergo continual fine-tuning on sequential tasks. This phenomenon arises from destructive interference among gradient directions, drift in internal representations, and changes in the loss landscape, resulting in irreversible loss of prior capabilities despite the acquisition of new competencies. Mechanistic analysis based on transformer-based architectures has revealed that gradient interference in attention weights, representational drift in intermediate layers, and loss landscape flattening are primary drivers. Severity correlates strongly with task similarity, and certain attention heads—especially in lower layers—are disproportionately disrupted. This foundational understanding has led to targeted strategies for mitigation, opening the door for more robust continual learning systems in LLMs (Imanov, 26 Jan 2026).
1. Formal Characterization and Mechanistic Drivers
Catastrophic forgetting in transformer-based LLMs is mechanistically traceable to three primary phenomena:
- Gradient Interference in Attention Weights: Let , be gradient vectors for tasks , . Their alignment is quantified as:
Negative signals destructive interference, with empirically observed 2.4-fold forgetting rate increase when . Query/key attention matrices exhibit the highest conflict rates (), acutely contributing to early forgetting.
- Representational Drift: Hidden-state activations in intermediate layers undergo notable geometric change during sequential fine-tuning. For layers $8$–$16$ in a $24$-layer decoder, CKA-normalized Euclidean drift ranges $0.32$–$0.47$, with leading principal components rotating up to . Lower layers (–$0.23$) and upper layers ($0.08$–$0.14$) are less affected.
- Loss Landscape Flattening: Hessian analysis indicates a sharp initial basin () at pre-tuning, which flattens to after subsequent tasks; this diminishes the capacity for recovery of prior minima.
Task similarity, quantified by semantic and gradient measures, shows a strong Pearson correlation with forgetting severity. Counterintuitively, higher similarity exacerbates forgetting due to parameter overlap in shared subsystems (Imanov, 26 Jan 2026).
2. Quantitative Manifestations and Attention Head Disruption
During continual fine-tuning:
- Approximately $15$– of attention heads exhibit Euclidean change exceeding $2.5$ standard deviations—classified as "severely disrupted."
- Disruption clusters predominantly in lower layers (layers $1$–$8$ in 24-layer models, $1$–$12$ in 32-layer models), highlighting their heightened vulnerability due to less specialization in attention patterns.
- Ablation of the top disrupted heads restores of lost prior-task performance with only an cost on new-task performance, confirming causal involvement.
Behaviorally, forgetting rates range from degradation for medium-similarity sequences to for low-similarity sequences (in the $400$B parameter regime). Larger models demonstrate slower forgetting but are subject to the same mechanistic triad (Imanov, 26 Jan 2026).
3. Theoretical Advances and Gradient Alignment Perspective
Recent work has established a direct theoretical link between negative gradient similarity and catastrophic forgetting. For mastered set and injection set : Forgetting occurs iff . Neuron-wise decomposition reveals that $50$– of neurons are "conflicting" (negative gradient similarity) while $25$– are "collaborative" (Yang et al., 29 Jan 2026). Collaborative Neural Learning (CNL) implements selective freezing of conflicting neurons, eliminating catastrophic forgetting under infinitesimal learning rates and exactly known mastered sets, with empirical reductions of forgetting between and under practical conditions.
4. Mitigation Strategies: Algorithmic and Mechanism-Driven Approaches
Mechanism-informed interventions have emerged:
- Targeted Regularization: Freezing or regularizing the weights of highly disrupted attention heads—especially in lower layers—constrains parameter drift.
- Representational Realignment: Affine loss functions on intermediate layers restore previously encoded features, yielding recovery of lost performance post hoc.
- Curvature Preservation: Hessian-based regularization maintains the sharpness of loss basins, reducing forgetting by with minor convergence slowdown.
- Dynamic Gradient Alignment Monitoring: Reduction of learning rate or increased regularization when negative alignment is detected prevents destructive updates.
- Composite Algorithms: Algorithms adjudicate per-parameter plasticity using combined scores of gradient conflict, representational importance, and curvature impact.
Ablation experiments validate the efficacy and specificity of such approaches: for example, selective regularization of attention heads and loss-surface flattening interventions can dramatically enhance retention during continual fine-tuning (Imanov, 26 Jan 2026).
5. Experimental Landscape and Benchmarks
Studies employ a diverse array of state-of-the-art decoder models—Llama 4 Scout ($109$B), Llama 4 Maverick ($400$B), DeepSeek-V3.1 ($671$B), GPT-5.1 ($1.5$T), Claude 4.5, and Gemini 2.5 Pro. Sequential fine-tuning covers $4$–$6$ tasks per trajectory, with inter-task similarity stratified. Identical hyperparameters (AdamW, learning rates –, gradient clipping at norm $1.0$) ensure comparability. The severity and temporal profile of forgetting aligns with mechanistic predictions: interference and head disruption dominate early epochs, representational drift in mid epochs, and basin flattening in late epochs (Imanov, 26 Jan 2026).
6. Practical Implications and Outlook
Integrating mechanistic knowledge into continual learning protocols, future LLMs can sustain prior-task competence while acquiring new skills. Practical deployment should:
- Monitor early gradient alignment as a warning signal.
- Apply targeted interventions to conflict-prone components (notably attention heads).
- Constrain intermediate layer drift without hampering task adaptation.
- Combine curvature, drift, and attention metrics for optimal plasticity–stability trade-off.
- Design composite mitigation strategies with adaptive component selection.
These advances transform catastrophic forgetting from an empirical obstacle to a tractable, mechanism-localized challenge. The findings offer robust pathways not only for continual fine-tuning but also for safe and efficient knowledge updating in LLM platforms (Imanov, 26 Jan 2026, Yang et al., 29 Jan 2026).