Papers
Topics
Authors
Recent
Search
2000 character limit reached

Catastrophic Forgetting in LLMs

Updated 8 February 2026
  • Catastrophic forgetting in language models is the degradation of previously acquired capabilities during fine-tuning, resulting in loss of performance on earlier tasks.
  • Mechanistic analysis reveals that negative gradient cosine similarity, representational drift with 35–52° rotation in key features, and loss landscape flattening are core contributors.
  • Mitigation strategies such as selective attention freezing and curvature-aware regularization can retain up to 71% of original task performance while incorporating new data.

Catastrophic forgetting in LLMs refers to the phenomenon whereby the performance on previously learned tasks or general capabilities degrades rapidly when the model is fine-tuned on new, often unrelated, data streams. This loss of prior knowledge impedes lifelong learning, limits transfer, and poses severe bottlenecks for deploying LLMs in evolving or multi-domain real-world settings. Recent research has shifted focus from empirical observation to mechanistic analysis, quantifying forgetting dynamics at multiple scales (parameters, heads, layers), and introducing targeted mitigation frameworks with precise theoretical grounding (Imanov, 26 Jan 2026, Luo et al., 2023, Yang et al., 29 Jan 2026).

1. Mechanistic Foundations of Forgetting in Transformer LLMs

Recent large-scale probes identify three principal mechanisms underlying catastrophic forgetting in transformer LLMs during continual fine-tuning:

  1. Gradient interference localized in attention weights: Destructive interference is quantified by negative cosine similarity between the gradients of the previous task (gA=θLAg_A=\nabla_\theta\mathcal{L}_A) and the new task (gB=θLBg_B=\nabla_\theta\mathcal{L}_B). Cosine similarity cos(gA,gB)<0\cos(g_A,g_B)<0 correlates with performance loss on the earlier task. Empirically, conflict is concentrated in query/key attention projections: 67% of these weights show negative gradient alignment vs. 29% in feedforward submodules, with 15–23% of attention heads undergoing severe post-fine-tuning disruption. Disruption rates exceed 28% in lower layers (layers 1–8 in 24-layer models), while upper layers remain more stable (<12% disrupted) (Imanov, 26 Jan 2026).
  2. Representational drift in intermediate transformer layers: After fine-tuning, features in hidden-state space exhibit marked geometric reconfiguration, especially in intermediate layers. Centered kernel alignment (CKA) drops by 0.32–0.47 in these layers post-adaptation, and leading subspace principal components rotate by 35–52°, signalling the reorientation of core feature manifolds. This drift is largely task-agnostic—CKA drop correlates poorly with task relevance (Pearson r=0.12r=0.12, p=0.24p=0.24)—and reduces the geometric overlap required for robust transfer (Imanov, 26 Jan 2026).
  3. Loss landscape flattening: Sequential learning flattens the local loss basin near prior-task minima, dramatically reducing the top Hessian eigenvalue (λ1\lambda_1 falls from 147\approx147 to 34\approx34 after three tasks), and raising the loss linearity index (0.280.710.28\to0.71). This flattening impedes restoration of prior weights and precedes measurable behavioral forgetting by 1–2 epochs—serving as a reliable early warning (Imanov, 26 Jan 2026). The link between sharpness and forgetting is corroborated by independent studies, which observe that increased loss-landscape curvature is predictive of susceptibility to CF in LLM tuning (Li et al., 2024).

2. Theoretical Explanations: Gradient Similarity and Neuronal Attribution

A gradient-similarity framework formalizes the conditions for forgetting: let M\mathcal{M} index the mastered set (examples already solved by the model), and I\mathcal{I} the incoming (injection) set encoding new knowledge. The similarity

S(M,I)=gM,gI=jgjMgjIS(\mathcal{M},\mathcal{I}) = \langle g^M, g^I \rangle = \sum_j g^M_j\,g^I_j

directly governs forgetting: if S<0S<0, a gradient step toward I\mathcal{I} increases loss on M\mathcal{M}. Decomposition yields

  • Conflicting neurons (sj<0s_j<0): parameter updates that directly harm old knowledge.
  • Collaborative neurons (sj0s_j\ge0): those that do not interfere or synergize.

Empirical quantification shows that 50–75% of neurons are conflicting at any update—explaining why unrestricted fine-tuning typically erases old capabilities. Freezing conflicting neurons guarantees zero forgetting for infinitesimal learning rates and known master sets (Yang et al., 29 Jan 2026).

3. Quantitative Evidence and Empirical Patterns

Multiple multi-model, multi-benchmark studies reveal robust patterns governing catastrophic forgetting:

  • Model scale effect: Forgetting increases with model size for LLMs (1B–7B). Larger models suffer higher percentage drop in knowledge, an effect driven by their higher initial performance before fine-tuning (Luo et al., 2023).
  • Task and gradient similarity: Inter-task gradient alignment (Pearson r=0.87r=0.87) is the strongest predictor of forgetting; more similar tasks induce less interference and thus less forgetting (Imanov, 26 Jan 2026).
  • Architectural dependence: Decoder-only models (e.g., BLOOMZ) are more resistant to CF than encoder-decoder models (e.g., mT0) at fixed scale (Luo et al., 2023).
  • Layerwise vulnerability: Lower transformer layers exhibit more plasticity and greater susceptibility to disruption, consistent with their higher input-gradient magnitudes and fundamental role in early feature formation (Imanov, 26 Jan 2026).

Empirical forgetting rates (instruction tuning, 1B–7B, BLOOMZ models):

Model FG_DK FG_Rs FG_RC
1.1B 9.5% 6.7% 18.0%
3.0B 14.6% 11.1% 27.6%
7.1B 18.4% 13.6% 26.8%

"FG" is the average percent drop on domain knowledge, reasoning, and reading comprehension benchmarks (Luo et al., 2023).

4. Mitigation Strategies: Mechanism-Driven and Empirical Approaches

A spectrum of mitigation strategies has been developed, targeting the identified mechanisms:

  • Selective attention head protection: Freezing or regularizing the 20% most at-risk attention heads (as identified by early negative cosine alignment) retains up to 64% of old task performance with <10% penalty to new-task learning (Imanov, 26 Jan 2026).
  • Representational realignment: Layerwise affine realignments (rotations and scalings) on intermediate activations regain ≈38% of lost accuracy, and, when combined with attention freezing, approach 71% retention (Imanov, 26 Jan 2026).
  • Curvature-aware regularization: Penalizing changes to the top kk Hessian eigenvalues during fine-tuning reduces forgetting by 34% with only moderate ($12%$) convergence slowdown—surpassing L2 or gradient-clipping (Imanov, 26 Jan 2026).

Complementary empirical defenses include:

  • Sharpness-Aware Minimization (SAM): Enforcing flat minima during fine-tuning, via the SAM objective, flattens the loss landscape and largely prevents CF, especially in larger models (7B, 13B). It orthogonally synergizes with replay and weight-averaging (Li et al., 2024).
  • Collaborative Neural Learning (CNL): Freezing all conflicting neurons (identified by gradient product) and updating only collaborative ones achieves zero forgetting in idealized settings, and 59–82% reduction in more practical ones (Yang et al., 29 Jan 2026).
  • Replay buffers and model merging: Experience replay (mixing small fractions of old example data at each step) provides robust retention in both language and speech adaptation (Hsiao et al., 23 May 2025). Model merging (e.g., linear or Slerp interpolation of task vectors) also yields reduced forgetting in cross-lingual or modality transfer (Alexandrov et al., 2024).

5. Broader Manifestations: Instruction Drift, Multilinguality, and Beyond

Forgetting is not confined to standard supervised NLU/NLG tasks:

  • Instruction drift and pseudo forgetting: Degradation of previous-task performance can stem from a loss of effective instruction-to-rationale mapping ("pseudo forgetting"). Simple interventions—such as appending partial rationales or task-agnostic prefixes—often restore prior behavior, indicating underlying knowledge is retained but not properly elicited (Sun et al., 2024).
  • Cross-lingual transfer: Non-Latin scripts are especially vulnerable in sequential multilingual fine-tuning, due to tokenizer and embedding sparsity. Non-shared or script-specific LoRA adapters mitigate the effect by isolating representational updates (Khelli et al., 29 Apr 2025). For translation, instruction-following ability rather than raw model architecture is the primary determinant of forgetting resilience, and PEFT methods such as LoRA do not, in isolation, confer forgetting resistance (Liu et al., 22 Oct 2025).
  • Multimodal and speech adaptation: In spoken language and vision–language pipelines, sequential adaptation across modalities (ASR→TTS→SQA or image captioning→VQA) leads to rapid collapse of original capabilities unless experience replay or sparse parameter updating is deployed (Hsiao et al., 23 May 2025, Hwang et al., 4 Feb 2026).

6. Open Challenges and Future Perspectives

Despite advances, critical questions remain:

Continual learning for LLMs remains a formidable challenge, as aggressive sequential adaptation induces rapid degradation of previously acquired skills unless model architectures and optimizers are explicitly designed for retention or employ task-aware adaptation protocols. Advances in mechanistic understanding open new avenues for achieving robust memory retention in trillion-parameter lifelong LLMs (Imanov, 26 Jan 2026, Yang et al., 29 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Catastrophic Forgetting in Language Models.