Evidence Forgetting Rate

Updated 19 January 2026

Evidence Forgetting Rate is a quantitative measure that captures how the influence of learned information diminishes over time in both human cognition and machine learning systems.
Mathematical models such as exponential decay, power-law decay, and convolutional approaches rigorously capture this rate, enabling measurement through recall probability, attack success, and loss metrics.
Optimizing forgetting rate can enhance memory recall, balance stability with plasticity, and improve privacy and adaptability in continual learning frameworks.

Evidence Forgetting Rate is a quantitatively defined parameter that captures the rate at which previously acquired, observed, or encoded information loses influence, salience, or retrievability within a given system. The term arises across disciplines—from cognitive psychology, where it describes human memory decay, to machine learning, where it governs data retention in streams, determines neural-network recall over time, and structures the dynamics of continual and LLM learning. Theoretical and empirical approaches rigorously model forgetting rates using exponential, power-law, and convolutional decay forms, and operationalize measurement through both accuracy and privacy-motivated audit metrics.

1. Mathematical Formalizations of Forgetting Rate

Forgetting rate is instantiated mathematically by specifying a functional decay law for retention or influence of evidence over time or sequence steps.

Exponential Decay: In both human and machine memory frameworks, the retention of an item, $R(t)$ , often follows $R(t) = e^{-\lambda t}$ , where $\lambda$ is the evidence forgetting rate (Tran et al., 28 Dec 2025, Yu et al., 2018).
Power-Law Decay: Empirical and classic cognitive studies frequently fit $R(t) = A t^{-\beta}$ , where $\beta$ is the decay exponent. This form captures the typical “fast-then-plateau” characteristic of human memory (Kline, 22 May 2025, Yu et al., 2018).
Convolutional Models: For repeated learning or spaced rehearsal, forgetting is modeled as the superposition (convolution) of impulse response kernels, each with their own decay rate, $a_2$ , yielding $y(t) = \sum_{n=1}^N a_1 e^{-a_2 (t-T_n)} + a_3$ ; instantaneous forgetting rate is $r(t) = \frac{a_2 a_1 \sum_n e^{-a_2 (t-T_n)}}{\sum_n (a_1 e^{-a_2 (t-T_n)} + a_3)}$ (Xie et al., 2019).
Markovian Absorption (Collective Forgetting): In social or cultural-semantic memory, forgetting is parameterized as the mean absorption rate $\lambda = \frac{q (p + r)}{r + q}$ , where $p,q,r$ are rates of transfer/decay between communicative memory, cultural memory, and oblivion (Candia et al., 2020).
Resource-Constrained Bayesian Updating: LLM in-context memory can be expressed as a discounted Bayesian update,

$R(t) = e^{-\lambda t}$ 0

where $R(t) = e^{-\lambda t}$ 1 (discount factor), or equivalent token-weighting $R(t) = e^{-\lambda t}$ 2 (Tran et al., 28 Dec 2025).

2. Measurement in Machines: Empirical and Theoretical Metrics

Operationalizing evidence forgetting rate requires extracting decay parameters from observed trajectories, audit tests, or performance metrics.

Recall Probability: In neural networks, recall is measured as the probability that the network’s hidden state aligns with a target prototype, tracking this probability over time/epochs since last exposure (Kline, 22 May 2025).
Attack Success Rate: In privacy studies, the metric is the success probability of a membership-inference or canary-extraction attack after removal of a sensitive datum; the forgetting rate is the drop in attack success per step (Jagielski et al., 2022).
Cross-Entropy Loss Relative to Base Model: Following fine-tuning, forgetting can be quantified as $R(t) = e^{-\lambda t}$ 3, the cross-entropy from pre-trained to fine-tuned model prediction distributions (Kalajdzievski, 2024).
Task Error or Loss Curves: In continual learning frameworks, the mean squared error on prior tasks (or classification error) as a function of new task iterations directly measures rate of forgetting (Mahdaviyeh et al., 4 Jun 2025).
Information Attention Decay: In collective memory, normalized citation attention ( $R(t) = e^{-\lambda t}$ 4) fit to an exponentially-mixed model, with $R(t) = e^{-\lambda t}$ 5 estimated via ODE parameter regression (Candia et al., 2020).

3. Psychological, Sociotechnical, and Algorithmic Drivers

Forgetting rate exhibits deep connections to cognitive constraints, cultural information load, task drift, and learning-systems architecture.

Information Volume Pressure: In communities (e.g., physics, invention), the forgetting rate $R(t) = e^{-\lambda t}$ 6 of works doubles over multi-decade intervals as collective output volume rises, validating the “forgetting as annulment” effect (Candia et al., 2020).
Stability–Plasticity Trade-Off: In LLM and cognitive modeling, rate $R(t) = e^{-\lambda t}$ 7 governs the balance between retaining old evidence (low $R(t) = e^{-\lambda t}$ 8), and rapid adaptation to new data (high $R(t) = e^{-\lambda t}$ 9) (Tran et al., 28 Dec 2025).
Concept Drift Adaptation: In online learning with non-stationary distributions, optimal performance requires tuning forgetting rate proportional to detected drift magnitude—a central tenet of the “sweet path” hypothesis (Zaidi et al., 2018).
Replay and Interference: Intensive replay or review reduces effective forgetting rates and can produce nonlinear (non-monotonic) effects, with small or misaligned replay sometimes increasing global forgetting via geometric task interference (Mahdaviyeh et al., 4 Jun 2025, Xie et al., 2019).
Randomness vs. Determinism: Stochasticity in training induces $\lambda$ 0 decay (“forgetting law”) in audit attack success; deterministic optimization can preclude forgetting entirely for specific memorized points (Jagielski et al., 2022).

4. Empirical Scaling Laws and Quantitative Values

Several studies provide fitted values for forgetting-rate parameters and scaling relationships.

LLMs: Empirical optimal forgetting rates $\lambda$ 1 span roughly $\lambda$ 2 to $\lambda$ 3 per token, corresponding to “half-lives” from dozens to hundreds of tokens; optimal Bayesian discount factors $\lambda$ 4 (Tran et al., 28 Dec 2025).
Power-Law Decay in MLPs: Single-task MLP classification without review fits a power curve $\lambda$ 5 ( $\lambda$ 6), with half-life $\lambda$ 7 epochs (Kline, 22 May 2025).
Collective Memory: APS papers (1950–1999) show $\lambda$ 8 increases from $\lambda$ 9 to $R(t) = A t^{-\beta}$ 0; USPTO patent forgetting rate rises from $R(t) = A t^{-\beta}$ 1 to $R(t) = A t^{-\beta}$ 2 (Candia et al., 2020).
Fine-Tuning Scaling Laws: Forgetting loss obeys $R(t) = A t^{-\beta}$ 3, with fitted exponents and strong $R(t) = A t^{-\beta}$ 4 on benchmark LLMs (Kalajdzievski, 2024).
Ebbinghaus-style Human Curves: Human and MLP memory decay exponents $R(t) = A t^{-\beta}$ 5 consistently emerge, with overlearning and spaced repetition flattening the decay (Kline, 22 May 2025, Yu et al., 2018).

5. Practical Implications and Interventions

Strategic manipulation of the forgetting rate has significant consequences for learning efficiency, memory safety, and system adaptability.

Review Scheduling: Spaced or intensive review resets, implemented according to the monitored forgetting-curve, robustly flatten decay and increase recall even past initial capacity—mirroring human overlearning (Xie et al., 2019, Kline, 22 May 2025).
Replay in LLM Pre-Training: Very low-cost focused stochastic replay can reduce entity forgetting rate by $R(t) = A t^{-\beta}$ 640–50% (relative on extraction metrics), yielding persistent improvements even in vanilla models and boosting zero-shot accuracy by $R(t) = A t^{-\beta}$ 7 points (Liao et al., 2024).
Optimal Matching to Drift: Online and incremental learners perform best when forgetting rate is adaptively matched to measured drift rates, following the sweet path: higher drift $R(t) = A t^{-\beta}$ 8 higher forgetting, lower model variance (Zaidi et al., 2018).
Cultural Selectivity: Knowledge communities adapt to rising forgetting rates by increasing selectivity at cultural-memory transfer ( $R(t) = A t^{-\beta}$ 9), preferentially “buffering” the most valuable artifacts into durable memory reservoirs (Candia et al., 2020).
Privacy and Unlearning: Natural forgetting induced by SGD randomness offers passive privacy amplification; privacy-preserving mechanisms benefit by focusing on “fresh” rather than long-trained data (Jagielski et al., 2022).

6. Limitations, Nuances, and Open Questions

Forgetting rate is not universally beneficial or monotonic in its effects; system structure and process details produce complex dependencies.

Non-monotonicity in Replay: Sample replay can worsen forgetting, with effect size and direction tightly linked to choice of samples and mutual geometry of task subspaces; monotonic benefit emerges only with sufficiently large or well-aligned replay (Mahdaviyeh et al., 4 Jun 2025).
Metric Sensitivity and Misestimation: Popular metrics like PPL and binary memorization scores fail to detect substantial fact-level forgetting, urging the adoption of entity-focused metrics in LLM assessments (Liao et al., 2024).
Interference-Limited Capacity: In human models, the noise-driven retention boundary naturally recovers Miller's “seven plus or minus two” item short-term memory constraint (Yu et al., 2018).
Non-convexity and Determinism: In specific non-convex systems or with fully deterministic training, memorized evidence may be indefinitely persistent regardless of elapsed steps (Jagielski et al., 2022).

7. Cross-Domain Convergence and Theoretical Synthesis

Across domains, evidence forgetting rate emerges as a central parameter structuring adaptive memory processes, whether in human cognition, artificial networks, knowledge communities, or large-scale LLMs. The trade-off between stability and adaptability, operationalized via exponential or power-law decay, is reflected in both the quantitative dynamics of learning systems and their practical regulatory mechanisms for memory management. Decades of work in both human and artificial domains converge on the finding that forgetting is not merely a deficit but a principled mechanism for efficiency, adaptability, and safe, scalable information integration.