RL's Razor: Why Online Reinforcement Learning Forgets Less

Published 4 Sep 2025 in cs.LG | (2509.04259v1)

Abstract: Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with LLMs and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $\textit{RL's Razor}$: among all ways to solve a new task, RL prefers those closest in KL to the original model.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that on-policy reinforcement learning minimizes catastrophic forgetting by constraining KL divergence from the base policy.
Empirical results reveal that RL retains prior knowledge while achieving comparable new-task improvements to supervised fine-tuning.
The study establishes a strong quadratic relationship between KL divergence and forgetting, offering insights for continual learning design.

RL's Razor: Mechanisms Underlying Reduced Forgetting in On-Policy Reinforcement Learning

Introduction

The paper "RL's Razor: Why Online Reinforcement Learning Forgets Less" (2509.04259) presents a systematic analysis of catastrophic forgetting in foundation models during post-training adaptation. The authors compare supervised fine-tuning (SFT) and on-policy reinforcement learning (RL), demonstrating that RL preserves prior knowledge more effectively, even when both methods achieve similar performance on new tasks. The central claim is that the degree of forgetting is governed by the KL divergence between the fine-tuned and base policy, measured on the new task distribution. RL's on-policy nature implicitly biases solutions toward minimal KL divergence, a principle termed "RL's Razor." This essay provides a technical summary of the paper's findings, empirical results, theoretical contributions, and implications for continual learning in AI.

Empirical Analysis: RL vs. SFT in Catastrophic Forgetting

The authors conduct extensive experiments across LLMs and robotic foundation models, evaluating the trade-off between new-task performance and retention of prior capabilities. Models are fine-tuned on new tasks using both SFT and RL (specifically GRPO), and their performance is measured on a suite of unrelated benchmarks to quantify forgetting.

RL consistently achieves new-task improvements while maintaining prior knowledge, whereas SFT's gains on the new task are accompanied by substantial degradation of prior abilities. This is visualized via Pareto frontiers, where RL's curve dominates SFT's, indicating superior retention at matched new-task accuracy.

Figure 1: Pareto frontiers of RL and SFT, showing RL maintains prior knowledge while SFT sacrifices it for new-task gains.

The empirical gap is most pronounced in tasks with multiple valid output distributions (e.g., generative tasks), where RL's on-policy updates constrain the model to solutions close to the base policy, while SFT can converge to arbitrarily distant distributions depending on the annotation source.

Figure 2: RL converges to KL-minimal solutions among policies that solve the new task, yielding higher prior-task retention compared to SFT.

KL Divergence as a Predictor of Forgetting

A key contribution is the identification of an empirical "forgetting law": the KL divergence between the fine-tuned and base policy, evaluated on the new task, reliably predicts the degree of catastrophic forgetting. This relationship holds across models, domains, and training algorithms, with a quadratic fit achieving $R^2=0.96$ in controlled settings and $R^2=0.71$ in LLM experiments.

Figure 3: Forgetting aligns to a single curve when plotted against KL divergence, showing KL as a strong predictor across methods.

The authors validate this principle in a toy ParityMNIST setting, where RL and SFT are compared under full convergence. SFT trained on an oracle distribution that minimizes KL divergence achieves even less forgetting than RL, confirming that KL minimization, not the optimization path, governs retention.

Figure 4: SFT distillation from an RL teacher matches the teacher's accuracy-forgetting trade-off, indicating the final distribution is what matters.

On-Policy Training and KL-Minimal Solutions

The paper provides both empirical and theoretical evidence that on-policy RL methods (e.g., policy gradient) are inherently biased toward KL-minimal solutions. This bias arises because RL samples from the model's own distribution, reweighting outputs according to reward, and thus updates the policy conservatively relative to its initialization.

A comparison of algorithm classes (on-policy/offline, with/without negative gradients) reveals that on-policy methods (GRPO, 1-0 Reinforce) consistently induce smaller KL shifts and retain prior knowledge more effectively than offline methods (SFT, SimPO), regardless of the use of negative examples.

Figure 5: On-policy methods retain prior knowledge more effectively and induce more conservative KL updates than offline methods.

Theoretical analysis formalizes this intuition: in the binary-reward case, policy gradient converges to the KL-minimal optimal policy within the representable family. The optimization can be viewed as alternating I- and M-projections in probability space, carrying the base policy into the set of optimal policies while preferring the closest solution in KL.

Alternative Hypotheses and Ablations

The authors systematically ablate alternative predictors of forgetting, including weight-level changes, representation drift, sparsity/rank of updates, and other distributional distances (reverse KL, total variation, $L_2$ ). None approach the explanatory power of forward KL divergence measured on the new task. This finding is robust across architectures and domains.

Optimization Dynamics and Representation Preservation

Analysis of optimization trajectories shows that per-step KL change is strongly correlated with the direction of forgetting gradients. RL fine-tuning integrates new abilities while leaving the overall representation space largely intact, as measured by CKA similarity, whereas SFT induces substantial representational drift.

Figure 6: Gradient similarity and KL change per step are anti-correlated on the new task, indicating larger steps induce more forgetting.

Scaling and Generalization

Experiments with larger model sizes (Qwen 2.5 3B, 7B, 14B) confirm that the fundamental trade-off between new-task performance and prior-task retention persists across scales. While larger models start with better general capabilities, SFT still incurs substantial forgetting to reach high accuracy on new tasks.

Implications and Future Directions

The findings motivate a new design axis for post-training algorithms: continual adaptation should explicitly minimize KL divergence from the base model to preserve prior knowledge. RL's on-policy bias provides a principled mechanism for conservative updates, but the principle is not limited to RL—offline methods can achieve similar retention if guided toward KL-minimal solutions.

Practical implications include the potential for hybrid algorithms that combine RL's retention properties with SFT's efficiency, and the use of KL divergence as a diagnostic and regularization tool in continual learning. The mechanistic basis for the KL–forgetting link remains an open question, as does its behavior at frontier model scales and in more diverse generative domains.

Figure 7: KL divergence between base and fine-tuned model on the new task correlates with forgetting performance across methods.

Conclusion

"RL's Razor: Why Online Reinforcement Learning Forgets Less" provides a rigorous empirical and theoretical account of catastrophic forgetting in foundation models. The central insight is that KL divergence to the base policy, measured on the new task, is a reliable predictor of forgetting, and that on-policy RL methods are naturally biased toward KL-minimal solutions. This principle reframes the design of post-training algorithms for continual learning, emphasizing conservative updates to enable long-lived, adaptive AI agents. Future work should explore mechanistic explanations for the KL–forgetting relationship, extend the analysis to off-policy RL and frontier-scale models, and develop practical algorithms that operationalize KL minimization for lifelong learning.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper asks a simple question with a big impact: when we teach an AI a new skill, why does it sometimes forget what it already knew? The authors compare two common ways to fine-tune big AI models—Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)—and discover a clear pattern: RL tends to forget less. They also uncover a simple rule, which they call “RL’s Razor”: among all the ways to solve a new task, RL naturally prefers solutions that stay closest to the original model’s behavior.

What questions does the paper ask?

The paper focuses on three easy-to-understand questions:

When we update a model to do a new task, which method makes it forget less—SFT or RL?
Can we predict how much a model will forget without checking every old task one by one?
Why does RL seem to preserve old knowledge better than SFT?

How did they study it?

To answer these, the authors ran experiments on both LLMs and robot control models, and also created a simple toy example to test ideas thoroughly.

LLM tasks included math reasoning, science Q&A, and tool use.
Robotics involved a pick-and-place task in a simulator.
They measured two things for each trained model: 1) How well it learned the new task. 2) How much it forgot on a variety of older, unrelated benchmarks (like general knowledge, logic, and coding).

They trained many versions of each model with different settings to see the best possible trade-offs between learning new stuff and keeping old skills (this set of “best trade-offs” is called a Pareto frontier).

They also built a simple toy setup called ParityMNIST, where an image is correct if the model says any even digit for even images and any odd digit for odd images. This is important because, like many real-world tasks, there are many different correct ways to give a right answer. This toy setting let them run clean, repeatable tests and prove their ideas more clearly.

Key ideas explained in everyday language

Supervised Fine-Tuning (SFT): The model learns by copying answers from labeled examples. Think of it like studying by repeatedly looking at a teacher’s answer key and trying to match it exactly.
Reinforcement Learning (RL): The model tries answers, gets a reward if it’s right, and adjusts itself to get more rewards next time. Think of it like practicing a sport: you try, see if you scored, and tweak your moves—but you’re mostly building on what you already do.
On-policy vs. offline: On-policy methods (like RL here) learn from the model’s own attempts. Offline methods (like SFT) learn from a fixed set of examples created by others. On-policy learning keeps you close to what you already do; offline learning can yank you toward a new style that may not match your old habits.
KL divergence: A measure of “how much your behavior changed.” Imagine your favorite ice cream flavors: before training, maybe you choose vanilla 70%, chocolate 20%, strawberry 10%. After training, if these percentages barely change, KL divergence is small. If they change a lot, KL is big. In this paper, smaller KL means the new model acts more like the old one.

What did they find?

Here are the main results:

RL forgets less than SFT at the same new-task performance. In other words, if both methods reach 90% on the new task, the RL-trained model usually keeps more of its old skills than the SFT-trained model.
“Forgetting follows KL”: How much a model forgets can be predicted by a single number—the KL divergence between the new model and the original model, measured on the new task. Bigger KL means more forgetting.
Why RL helps: RL is “on-policy,” meaning it learns from its own outputs. That naturally keeps updates small and closer to the original model (small KL). SFT can pull the model toward whatever labels it’s given, even if that makes it drift far from how it used to behave (large KL).
It’s not about “negative examples” alone: The authors tested several algorithms and found that the key factor is whether the method is on-policy (learning from your own tries), not whether it uses negative feedback.
An “oracle SFT” proves the point: When they carefully designed SFT labels to minimize KL while still being perfectly correct, SFT forgot even less than RL. This shows the secret sauce is staying close in KL, not the RL algorithm itself.

Why is this important?

Practical prediction: You don’t need old-task data to guess whether you’re causing forgetting. Just measure how much your model’s behavior changes (KL) on the new task. If KL is small, you’re probably safe.
Safer fine-tuning: If you want models that learn new things without losing old abilities (like chatbots that learn a new tool without getting worse at reasoning), aim to minimize KL change during training.
Better design choices: Use on-policy approaches (like RL) or guide SFT to prefer solutions that stay close to the original model. This helps build “long-lived” AI systems that keep improving over time without wiping their previous knowledge.

Final takeaway and impact

The paper’s simple rule—RL’s Razor—says that among all the ways to solve a new task, the best are the ones that change the model the least (small KL). RL naturally leans in that direction because it learns from its own behavior. This insight helps researchers and engineers design training methods that let AI models grow new skills while protecting old ones, pushing us closer to reliable, continually learning AI assistants.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

Empirical scope is narrow: results are shown for a single mid-sized LLM family (Qwen 2.5 3B) and one robotic foundation model (OpenVLA 7B) on a few tasks; it is unknown whether the KL–forgetting law and RL’s Razor hold for frontier-scale models, multilingual/vision-LLMs, long-context tasks, code, dialog safety, or real-world robotics.
Real-world robotics not tested: findings are validated in simulation; sim-to-real transfer and physical deployment constraints (latency, safety, nonstationarity) are unaddressed.
Sequential continual learning: the study considers adaptation to one new task; it is unknown how KL accumulation behaves over long sequences of tasks, whether the “KL budget” composes additively, and how retention scales across many updates.
Off-policy RL is not studied: the paper explicitly omits off-policy algorithms; it is unclear whether methods with importance sampling, behavior cloning regularizers, or replay buffers maintain the same KL-minimal bias and forgetting behavior.
Algorithm coverage is limited: only GRPO, 1–0 REINFORCE, SFT, and SimPO are compared; behavior under PPO/TRPO, AWR/AWAC, online DPO/KTO variants, conservative policy iteration, or trust-region policy optimization remains unknown.
Role of explicit KL regularization in RL: experiments use RL without explicit KL penalties; the interaction between implicit on-policy KL minimization and explicit KL regularizers (coefficient choice, scheduling, trust-region radius) is unquantified.
Mechanistic cause of forgetting remains unclear: why larger forward KL on the new-task distribution causes prior-task degradation (e.g., representational interference, capacity constraints, layer-wise drift) is not mechanistically established.
Correlation vs causation: the work demonstrates strong correlations between forward KL and forgetting, but does not perform interventions that hold new-task performance constant while directly manipulating KL to establish causal effects in large models.
Estimating KL at scale: the paper uses approximate forward KL measured on the new-task distribution; the bias/variance of the estimator, sensitivity to prompt distribution, sequence length, decoding temperature, and token-level vs sequence-level KL are not analyzed.
Which divergence matters when: forward KL outperforms reverse KL/TV in their settings, but conditions under which reverse KL or other f-divergences better predict forgetting are not characterized.
New-task distribution dependence: the predictor relies on $E_{x\sim\tau}[\mathrm{KL}(\pi_0||\pi)]$ ; it is unknown how sensitive conclusions are to coverage/shift of $\tau$ , dataset curation, or nonstationary new-task distributions.
Oracle SFT in practice: while an oracle KL-minimal SFT distribution beats RL in the toy setting, the paper does not provide practical methods to approximate or collect such oracle distributions for complex LLM tasks.
On-policy supervised data collection: beyond 1–0 REINFORCE, it is unclear whether DAgger-style on-policy relabeling or active data selection for SFT can replicate RL’s KL-minimal path and retention at scale.
Negative examples conclusion may be underdetermined: SimPO uses externally sampled negatives; it remains open whether on-policy negatives with offline updates, or calibrated preference data, would meaningfully affect KL and forgetting.
Effect of exploration and entropy bonuses: how entropy regularization, temperature schedules, or diversified sampling impact KL movement and forgetting is not explored.
Reward design beyond binary success: the study uses binary rewards; it is unknown whether dense/graded rewards, step-level credit assignment, or preference-based rewards change the KL–forgetting relationship.
Hyperparameter confounds: while many configurations are tried, a controlled analysis of optimizer choice, learning-rate schedules, batch size, and early stopping on KL/retention trade-offs is missing.
Numeric precision effects: bfloat16-induced apparent sparsity is identified, but the broader impact of numeric precision and quantization on update magnitudes, KL movement, and forgetting is not systematically studied.
Capability-locality of forgetting: prior-task evaluation aggregates diverse benchmarks; it is unclear which capability clusters (e.g., commonsense, factual recall, coding) are most sensitive to KL shifts or which layers/submodules mediate the effect.
Function class and nonconvexity gaps: theory assumes finite Y, convex policy families, and binary rewards; extensions to nonconvex neural nets, autoregressive sequence models, continuous actions, partial-credit rewards, and practical optimization noise are unproven.
Autoregressive structure: the paper does not analyze how per-token vs per-sequence KL, exposure bias, or credit assignment in long sequences affect the KL–forgetting law.
Safety and alignment implications: minimizing forward KL preserves prior behavior, which may include harmful biases; how to trade off retention against necessary distributional shifts for safety/alignment is not addressed.
Compute/sample efficiency: RL is typically more expensive; the paper does not quantify compute/sample cost per unit of new-task gain at fixed forgetting, or whether KL-constrained SFT can match RL’s retention more efficiently.
Task multiplicity assumption: RL’s Razor relies on multiple equally good solutions; for tasks with near-unique targets, does the advantage persist, and how can solution multiplicity be measured or induced?
Continual pretraining vs post-training: interactions between pretraining strategies (e.g., continual pretraining) and post-training KL control are not investigated.
Practical KL control mechanisms: beyond the high-level prescription to minimize forward KL, concrete, scalable training recipes (e.g., mirror descent with forward-KL trust regions, constrained decoding, base-distribution reweighting) are not developed or tested.
Robustness to data overlap/contamination: the impact of overlap between new-task data and prior benchmarks on measured forgetting and KL is not examined.
Generalization of the quadratic fit: the approximate quadratic relationship between forward KL and forgetting is empirical; its functional form, scaling with model size, and theoretical derivation remain open.
Capability recovery strategies: the paper does not test post-hoc recovery (e.g., small replay, distillation from base, LoRA merges) under matched KL to assess whether forgetting can be reversed without violating the KL constraint.
Public resources and reproducibility: detailed KL estimation code, on-policy data collection pipelines, and full hyperparameter sweeps are not described in sufficient detail to assess reproducibility across labs and hardware.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now by integrating on-policy RL and KL-aware procedures into existing post-training and MLOps workflows. Each item notes sectors, concrete tools/workflows, and feasibility assumptions.

Bold: KL-budgeted post-training for new features without forgetting
- Sectors: software, consumer AI, enterprise AI
- What: Prefer on-policy RL (e.g., GRPO, simple 1–0 Reinforce) over SFT when adding capabilities to LLMs or VLMs to maintain prior skills at matched new-task performance. Track forward KL on new-task prompts during training and early-stop when a KL budget is exceeded.
- Tools/workflows: on-policy sampling trainer, forward-KL monitor E_{x~τ}[KL(π0||π)], “KL budget” early stopping
- Assumptions/dependencies: access to base model logits/tokenizer; representative prompt set for the new task; RL training stability and compute budget
Bold: MLOps governance with KL drift gates
- Sectors: software, platform/MLOps, regulated industries
- What: Integrate forward-KL drift as a release gate for post-training updates. Treat E_{x~τ}[KL(π0||π)] as a proxy for forgetting when past-task data is unavailable.
- Tools/workflows: CI/CD hooks to compute forward KL on holdout new-task prompts; dashboards; alerting and rollback on KL breaches
- Assumptions/dependencies: retained base snapshot; standardized KL estimation protocol; agreed KL thresholds per product risk profile
Bold: KL-aware SFT data curation
- Sectors: data labeling, education content, customer-support bots
- What: When multiple correct outputs exist, choose/author labels closest in distribution to the base model’s outputs to reduce forgetting (e.g., select paraphrases consistent with base style, formatting, and reasoning steps).
- Tools/workflows: base-model candidate generation; annotator UI that surfaces base outputs; selection heuristics minimizing token-level forward KL
- Assumptions/dependencies: ability to generate base outputs; clear correctness criteria; may increase labeling effort
Bold: Reward-light RLHF via binary success signals
- Sectors: software, agents, tool-use automation
- What: Replace complex reward models with binary success/failure indicators (as used in the paper) to drive on-policy RL while retaining prior capabilities.
- Tools/workflows: programmatic checks for tool success, unit-test-style harnesses, pass/fail rubric for reasoning tasks
- Assumptions/dependencies: well-defined success criteria; coverage across tasks; limited reward hacking risk
Bold: Continual personalization with KL safety rails
- Sectors: consumer assistants, customer support, productivity tools
- What: Adapt models to user/org preferences using on-policy feedback (thumbs up/down, accept/overwrite), enforcing a per-user KL budget to avoid overfitting and global skill loss.
- Tools/workflows: per-tenant LoRA heads trained on-policy; KL per-tenant monitoring; safe rollout/rollback
- Assumptions/dependencies: storage for per-tenant baselines; privacy and consent for learning from interactions
Bold: RL-to-SFT distillation for cost-effective deployment
- Sectors: software, edge AI
- What: Train with on-policy RL to reach a KL-minimal solution, then distill the learned distribution into an SFT model to reduce serving cost while retaining the low-forgetting behavior.
- Tools/workflows: self-play data generation; teacher-student distillation; regression to teacher logits or samples
- Assumptions/dependencies: teacher access; sufficient synthetic data; careful temperature control to preserve distribution
Bold: Robotics skill adaptation without erasing prior skills
- Sectors: robotics, manufacturing, logistics, home robotics
- What: Use on-policy RL to teach new tasks (e.g., novel pick-and-place variants) while tracking forward KL versus the base robotic policy to preserve a library of existing skills.
- Tools/workflows: sim-first on-policy data collection; KL monitoring on task observations; safety shields for real-world trials
- Assumptions/dependencies: safe on-policy experience collection; sim-to-real alignment; sensor/action compatibility across tasks
Bold: Forgetting risk scoring when prior-task data is unavailable
- Sectors: academia, industry R&D
- What: Use E_{x~τ}[KL(π0||π)] as a practical, task-local predictor of forgetting to prioritize experiments, set early stopping, and compare methods at matched new-task accuracy.
- Tools/workflows: standardized prompt suites per new task; Pareto frontier plotting of new-task accuracy vs forward KL vs prior-skill proxy benchmarks (when available)
- Assumptions/dependencies: empirical relationship holds best for generative tasks with multiple valid solutions; calibration needed by domain
Bold: Federated and edge personalization with KL anchoring
- Sectors: mobile, IoT, enterprise devices
- What: Run small on-policy updates locally and enforce a KL cap relative to a global base to prevent catastrophic drift while achieving personalization.
- Tools/workflows: local RL w/ binary reward; periodic KL audits against cached base logits; server-side policy review
- Assumptions/dependencies: device compute; secure storage of base reference; privacy-preserving telemetry of KL metrics

Long-Term Applications

These applications require additional research, standardization, or scaling to diverse domains, but are directly motivated by the paper’s KL–forgetting law and the RL’s Razor principle.

Bold: Constrained training algorithms that explicitly implement RL’s Razor
- Sectors: software, robotics, academia
- What: New optimizers that minimize forward KL to the base subject to achieving a target reward/accuracy (solve: minimize E[KL(π0||π)] s.t. E[R] ≥ r*), generalizing policy gradient behavior.
- Tools/products: trust-region–like solvers for forward KL; differentiable constraint handling; open-source libraries
- Assumptions/dependencies: efficient estimators; stability for large models; compatibility with PEFT
Bold: Automated “oracle SFT” labelers
- Sectors: data platforms, education, enterprise QA
- What: Systems that automatically select or rewrite labels to be correct and closest in forward KL to the base model (approximating the oracle SFT distribution shown to forget less than RL).
- Tools/products: base-model proposal generation; correctness validators; KL-based reranking; active learning
- Assumptions/dependencies: reliable correctness checks; scalable reranking; risk of style lock-in if overconstrained
Bold: KL-drift reporting standards for AI governance
- Sectors: policy/regulation, high-risk AI (healthcare, finance, critical infrastructure)
- What: Require reporting of forward KL drift from base when post-training models used in safety-critical contexts are updated, with context-dependent KL budgets and audit trails.
- Tools/products: compliance APIs; third-party KL audits; certification programs
- Assumptions/dependencies: consensus on measurement protocols; cross-vendor comparability; legal frameworks
Bold: Continual-learning “operating systems” for long-lived agents
- Sectors: consumer agents, enterprise copilots, research agents
- What: Always-on on-policy learning with dynamic KL budgets per capability, automatic early stopping, and rollback; multi-objective controllers trading new-task reward vs KL vs safety constraints.
- Tools/products: agent training control planes; KL-aware schedulers; capability-specific baselines
- Assumptions/dependencies: reliable online reward signals; robust drift detection; cost management
Bold: Fleet-scale robotic adaptation under KL budgets
- Sectors: logistics, manufacturing, agriculture, service robotics
- What: Deploy a shared base policy across a fleet and allow local on-policy updates within KL limits so units adapt to local variation without losing shared skills.
- Tools/products: fleet orchestration with KL policy constraints; shared skill libraries; cross-site evaluation
- Assumptions/dependencies: safe exploration; heterogeneity in hardware; real-time monitoring infrastructure
Bold: Enterprise multi-tenant model hosting with safe per-tenant adaptation
- Sectors: SaaS, customer support, legal/finance advisory
- What: Provide per-tenant on-policy adaptation with strict KL budgets to base and reversible snapshots so global model quality and safety baselines are preserved.
- Tools/products: tenant isolation; versioned base snapshots; KL-gated deployment; audit logs
- Assumptions/dependencies: storage/compute cost; access control; SLAs for drift incidents
Bold: Lifelong tutoring systems that adapt without losing curriculum coverage
- Sectors: education
- What: Student-specific adaptation via on-policy feedback (problem attempts, hints accepted) with KL budgets to preserve general pedagogical competence and standards alignment.
- Tools/products: adaptive curricula with correctness signals; KL-aware personalization controllers
- Assumptions/dependencies: robust scoring of learning gains; fairness and privacy safeguards
Bold: Clinical decision support tuned to local protocols without eroding general knowledge
- Sectors: healthcare
- What: On-policy RL with binary protocol adherence signals and tight KL budgets to the base (evidence-based medicine) to localize to hospital workflows.
- Tools/products: clinical simulators/checklists; KL governance panels; post-update validation suites
- Assumptions/dependencies: regulatory approval; rigorous validation; bias and safety monitoring
Bold: Compliance-centered assistants that adapt to firm policies yet retain conservative defaults
- Sectors: finance, legal, risk
- What: On-policy RL from compliance officer feedback with KL caps to base safety behaviors; automatic forgetting risk alerts from KL spikes.
- Tools/products: review consoles; KL-based risk scoring; change management workflows
- Assumptions/dependencies: curated policy corpora; human-in-the-loop oversight; auditability
Bold: Auto-early stopping and rollback driven by forgetting forecasts
- Sectors: MLOps, platform
- What: Predict catastrophic forgetting from real-time forward KL growth on new-task data and pause/rollback training automatically before prior-task regressions manifest.
- Tools/products: online KL estimators; forecast models; policy engines for training control
- Assumptions/dependencies: stable KL–forgetting correlation in the target domain; low-latency training control hooks

Cross-cutting assumptions and dependencies

On-policy sampling access: Training pipelines must sample from the current model and retain a compatible base model for KL reference (same tokenizer/logit space).
KL estimation quality: Forward KL is estimated on a representative new-task input distribution; mis-specified prompt sets can distort forgetting risk.
Compute and stability: On-policy RL typically costs more than SFT; binary rewards and lightweight policy-gradient variants can mitigate overhead but still require careful tuning.
Domain limits: Empirical law validated on moderate-scale LLMs and robotic models; behavior at frontier scales and in other modalities may require additional evidence and calibration.
Safety and oversight: Especially in robotics and safety-critical sectors, on-policy data collection and updates need guardrails, simulators, and human review.

View Paper Prompt View All Prompts

Open Problems

Continue Learning

Authors (3)

Collections

Tweets

HackerNews

RL's Razor: Why Online Reinforcement Learning Forgets Less (3 points, 0 comments)

alphaXiv

RL's Razor: Why Online Reinforcement Learning Forgets Less (217 likes, 3 questions)

RL's Razor: Why Online Reinforcement Learning Forgets Less

Summary

RL's Razor: Mechanisms Underlying Reduced Forgetting in On-Policy Reinforcement Learning

Introduction

Empirical Analysis: RL vs. SFT in Catastrophic Forgetting

KL Divergence as a Predictor of Forgetting

On-Policy Training and KL-Minimal Solutions

Alternative Hypotheses and Ablations

Optimization Dynamics and Representation Preservation

Scaling and Generalization

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions does the paper ask?

How did they study it?

Key ideas explained in everyday language

What did they find?

Why is this important?

Final takeaway and impact

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

Tweets

HackerNews

alphaXiv

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research