Memorization Dynamics in Knowledge Distillation
- The study quantifies memorization through precise metrics, showing up to a 50% reduction compared to standard fine-tuning methods.
- It demonstrates that protocols like soft KD and reverse KL lower memorization risks by smoothing outputs and mitigating high-confidence spikes.
- Feature-based predictability enables the filtering of high-risk examples, guiding adaptive distillation strategies to reduce inherited privacy leaks.
Knowledge distillation (KD) is a widely adopted technique to compress and transfer capabilities from large teacher models to smaller student models. Beyond efficiency and generalization benefits, KD has become a critical tool for mitigating memorization risks associated with training data leakage, especially in privacy-sensitive deployments. The memorization dynamics within KD—including how and which data are memorized, how risk varies across distillation protocols and architectures, and how to predict or mitigate undesired memorization—have recently been characterized with rigorous quantitative methods. Distillation can substantially reduce memorization compared to standard fine-tuning (often by over 50%), concentrate residual memorization on “easy-to-memorize” instances, and demonstrate protocol-dependent differences in privacy risk inheritance (Borkar et al., 21 Jan 2026).
1. Formal Definitions and Quantification of Memorization
Memorization in LLMs is formally defined through the exact reproduction of target data given a prompt, extending both “verbatim memorization” and extractive metrics. The dominant protocol requires a sequence of length , split into a prefix () and a suffix (). A sequence is “memorized” if the model’s greedy continuation reproduces the suffix exactly, (Borkar et al., 21 Jan 2026). In instruction-tuned and translation settings, memorization fraction is computed as:
where denotes prompt, context, and response, and is the indicator function (Singh, 19 Jun 2025). Extractive memorization metrics consider exact reproduction after observing only a fraction (e.g., 75%) of input tokens (Dankers et al., 3 Feb 2025). Auxiliary metrics such as ROUGE-N/L scores, perplexity, and entropy (e.g., Zlib entropy in bytes/token) are used to correlate and cluster easily memorized examples (Borkar et al., 21 Jan 2026).
2. Impact of Distillation Protocols on Memorization
Knowledge distillation protocols are diverse, with substantial effects on memorization rates:
- Soft (logit-level) KD: Minimizes forward KL divergence between teacher and student logits, producing smoothed probability distributions and attenuating forced memorization of rare sequences. For Pythia models, soft KD yields a memorization rate of 0.07% on the FineWeb subset, compared to 0.17% for the baseline and 0.33% for the teacher (a ∼2.4× reduction) (Borkar et al., 21 Jan 2026).
- Hard (sequence-level) KD/Seq-KD: Trains the student on synthetic suffixes generated by the teacher, matched to greedy or beam-decoded sequences. While overall memorization rates are similar (0.07% for both soft and hard KD), hard KD more frequently inherits “teacher-exclusive” memorization (2.7× more such examples than soft KD) (Borkar et al., 21 Jan 2026).
- Reverse KL and JS-divergence KD variants: Using reverse KL divergence (RKLD) for the distillation objective further suppresses high-confidence spikes and thus reduces memorization and membership inference risk compared to classical KL; e.g., memorization fraction drops from 0.523 (SFT) to 0.090 (RKLD) for GPT-2 760M (Singh, 19 Jun 2025, Zhang et al., 9 Aug 2025).
| Protocol | Memorization Fraction (Example) | Inherited Teacher-Exclusive (%) |
|---|---|---|
| Fine-Tuning (SFT) | 0.17–0.52% | High |
| Soft KD (KL) | 0.07–0.47% | 0.9% of teacher-exclusive |
| Hard KD (Seq-KD) | 0.07–0.31% | 2.7× vs. soft KD |
| Reverse KL (RKLD) | 0.06–0.09% | Lower than KL |
Distillation objectives that inject uncertainty (soft logits, JS or reverse KL) systematically lower memorization, while those that emphasize exact sequence reproduction tend to amplify inherited risk, especially for teacher-specific or rare examples (Borkar et al., 21 Jan 2026, Singh, 19 Jun 2025, Zhang et al., 9 Aug 2025).
3. Predictability and Concentration of Example-Level Memorization
A striking observation is that memorization during KD is highly concentrated and predictable:
- Easy-to-memorize core: Across architectures and datasets, >95% of memorization by distilled models arises from a small, intrinsic “core” of examples that are also memorized by both teacher and baseline (e.g., for Pythia 1.4B student trained on FineWeb, 676 of 706 memorized instances belong to this set) (Borkar et al., 21 Jan 2026).
- Feature-based predictability: Zlib entropy (compression bytes/token), baseline perplexity, and teacher–baseline KL divergence robustly identify memorized instances. Logistic regression classifiers using these features achieve ROC-AUC ≈ 0.9997 and recall/precision >99% pre-distillation (Borkar et al., 21 Jan 2026).
- Filtering: Strategic removal of high-risk examples flagged by entropy/perplexity prior to distillation virtually eliminates student memorization (e.g., reducing new memorized examples by 99.8%) (Borkar et al., 21 Jan 2026).
| Feature | Dominance in Prediction | Coefficient (Example) |
|---|---|---|
| Zlib Entropy | Highest | –4.50 ± 0.13 |
| KL Divergence | Moderate | –1.06 |
| Perplexity | Moderate | –0.40 (baseline), –0.33 (teacher) |
This suggests that memorization inheritance in distillation can be systematically forecast and mitigated, with strong implications for privacy-aware workflow design.
4. Inheritance, Amplification, and Privacy Risk
KD is not a complete privacy panacea—students can inherit and amplify certain memorization risks:
- Inheritance of teacher-specific examples: Hard distillation protocols disproportionately transfer teacher-exclusive memorized examples, heightening risk for confidential or rare data (e.g., hard student inherits 50 such examples versus 18 for soft KD) (Borkar et al., 21 Jan 2026).
- Amplification in sequence-level KD: In NMT, students trained via SeqKD not only inherit higher exact-match and extractive memorization rates versus baselines (+3.4% and +57%, respectively), but also show elevated hallucination rates (+13.8% detached, +31% oscillatory) (Dankers et al., 3 Feb 2025).
- Privacy leakage across tasks and objectives: The degree of membership and memorization risk varies widely by KD method, target task, student architecture, and blockwise layer analysis. Reverse KL objectives and reduced student-generated output ratios mitigate leakage, but do not fully eliminate risk (Zhang et al., 9 Aug 2025). Notably, membership inference risk and verbatim memorization risk often disagree: creative/open tasks have high membership leakage but low extractability, whereas closed/Q&A/classification tasks have the opposite pattern.
Block-wise analysis reveals that privacy “hotspots” concentrate in specific transformer layers, differing by architecture, suggesting targeted defense strategies such as layer reinitialization or block-level differential privacy (Zhang et al., 9 Aug 2025).
5. Mitigation Strategies and Practical Guidelines
Several mitigation strategies and workflow recommendations have emerged from empirical and theoretical studies:
- Data-centric filtering: Employ entropy, perplexity, and KL-based metrics to remove high-risk examples before distillation, dramatically shrinking memorization (Borkar et al., 21 Jan 2026).
- Protocol selection: Prefer soft/logit-level KD or reverse KL for lower memorization risk, especially when privacy preservation is paramount. Sequence-level KD should be used with caution and active monitoring (Dankers et al., 3 Feb 2025, Borkar et al., 21 Jan 2026, Zhang et al., 9 Aug 2025).
- Adaptive distillation: Interventions such as Adaptive-SeqKD—finetuning the student on high-quality KD data (e.g., top 20% by COMET-QE)—reduce both extractive memorization and hallucinations (e.g., ExMem decreased by 14%, OscHal by 25–33%) with minimal performance loss (Dankers et al., 3 Feb 2025).
- Capacity and output tailoring: Smaller student models, lower ratios of student-generated outputs, and higher-beam KD configurations generally produce safer, less memorizing models (Zhang et al., 9 Aug 2025).
- Hybrid privacy approaches: Future work advocates for combining distillation with formal privacy methods (e.g., DP-SGD), adding exposure scoring, and routinely auditing both membership and memorization risks (Zhang et al., 9 Aug 2025, Singh, 19 Jun 2025).
6. Memorization in Knowledge-Augmented Distillation
In knowledge-intensive reasoning tasks, direct memorization by small models is limited by capacity constraints. Knowledge-Augmented Reasoning Distillation (KARD) combines chain-of-thought rationale distillation with external retrieval, theoretically and practically reducing parametric memorization demands (Kang et al., 2023):
- Theoretical bounds: In meta-learning setups, knowledge augmentation with non-parametric retrieval enables the student to predict with substantially fewer memorized bits, compared to without augmentation.
- Empirical effects: External retrieval allows small students (e.g., 250M T5) to outperform much larger fine-tuned models on datasets such as MedQA-USMLE, StrategyQA, and OpenbookQA. Data- and model-size efficiency are both sharply improved, with retention of rare/complex facts supported by retrieval (Kang et al., 2023).
- Practical recommendations: For knowledge-intensive distillation, always pair reasoning transfer with knowledge augmentation; tune retrieval and reranking pipelines to maximize relevant memory externalization.
7. Implications and Future Research Avenues
Current research demonstrates that knowledge distillation systematically, though not universally, reduces training-data memorization and related privacy risks while improving generalization. Residual memorization is strongly concentrated, predictable, and tractable via data-centric filtering and careful protocol selection. Key open directions include:
- Formalizing guarantees for privacy in KD, including synergy with differential privacy techniques.
- Extending analysis to domain-specific and low-resource applications (medical, legal, etc.).
- Blockwise and data-centric adaptive defenses tailored to observed privacy hot spots.
- Continued monitoring of both membership inference and memorization extractability risks, especially as new distillation algorithms and modalities emerge.
A plausible implication is that knowledge distillation, with rigorously deployed filtering, adaptive and knowledge-augmented protocols, and privacy-aware auditing, constitutes a robust foundation for future high-utility, low-leakage model compression in real-world settings (Borkar et al., 21 Jan 2026, Singh, 19 Jun 2025, Zhang et al., 9 Aug 2025, Dankers et al., 3 Feb 2025, Kang et al., 2023).