FIRE: Fine-Tuning with Integrated Replay & Efficiency

Updated 27 January 2026

FIRE is a comprehensive framework that integrates replay and adaptive data selection to overcome excessive data requirements and catastrophic forgetting during model fine-tuning.
It employs strategies such as adaptive difficulty selection, prioritized experience replay, and regularized loss formulations across LLMs, code generation systems, and MLIPs to enhance sample efficiency and stability.
Empirical results demonstrate significant improvements, including up to 65% fewer training steps and robust continual performance, validating FIRE’s effectiveness in diverse domains.

Fine-Tuning with Integrated Replay and Efficiency (FIRE) is a data-centric framework for efficient adaptation and continual improvement of machine learning models, primarily targeting LLMs, code generation systems, and machine-learned interatomic potentials (MLIPs). FIRE encompasses a suite of synergistic methodologies that combine replay, targeted data selection, regularization, and algorithmic innovations, enabling sample-efficient, stable, and scalable fine-tuning for both supervised and reinforcement learning scenarios.

1. Core Principles and Conceptual Framework

FIRE addresses two leading challenges in modern fine-tuning: excessive data and compute requirements, and catastrophic forgetting of previously acquired capabilities during adaptation. Its defining principle is the systematic integration of two ingredients: (i) replay—reusing informative samples from past or auxiliary distributions, and (ii) algorithmic enhancements for selecting, prioritizing, or stabilizing new training examples. This dual approach is instantiated through various concrete techniques across different domains:

Adaptive difficulty selection and rollout replay in LLM RLFT (Sun et al., 5 Jun 2025)
Prioritized experience replay with targeted beam/program selection for code LLMs (Chen et al., 2024)
Reset-based replay with hybrid SFT/preference objectives for preference optimization (Liu et al., 8 Aug 2025)
Replay-augmented fine-tuning with continual sampling and buffer mechanisms for MLIPs (Liu et al., 25 Jan 2026, Kim et al., 18 Jun 2025)
Regularized approximate replay in LoRA-based instruction tuning (Riemer et al., 26 Dec 2025)
Integration with scalable continual learning via LoRA, consolidation, and merging in multi-task settings (Hickok, 18 May 2025)

These approaches employ replay both as a means to improve data utilization efficiency and as a tool to enforce a plasticity-stability trade-off essential for practical continual learning.

2. Replay Mechanisms and Data Selection

FIRE incorporates multiple replay mechanisms tailored to domain requirements:

Rollout Replay in LLM RLFT: In on-policy policy-gradient training (GRPO), recent rollouts are buffered and reused in subsequent updates, reducing per-step sample complexity by 11–13%. Off-policy corrections are made via importance weighting, maintaining theoretical consistency (akin to off-policy PPO) (Sun et al., 5 Jun 2025).
Possibility and Pass-rate Prioritized Experience Replay (PPER): In code generation, beam search is used to collect multiple candidate programs per prompt, each scored by generation likelihood and empirical pass rate. The P2Value metric, a convex combination of model and empirical signals, governs replay prioritization, which is implemented through rank-based sampling (Chen et al., 2024).
Replay Buffers in MLIP Fine-tuning: Small buffers comprising representative pretraining samples are interleaved with target-specific data during stochastic optimization. Partial or continual fine-tuning updates are performed using a composite loss that balances task adaptation against general knowledge retention (Liu et al., 25 Jan 2026).
Regularized Approximate Replay: In supervised LLM fine-tuning, batches of samples drawn from a pretraining-like corpus (e.g., openwebtext) are introduced during every update, combined with a KL-regularization penalty to the base model outputs, preventing catastrophic forgetting (Riemer et al., 26 Dec 2025).

Data selection often targets examples expected to maximize learning signal (e.g., moderate-difficulty problems in RLFT) or sample efficiency (e.g., clustering-based selection in MLIPs or pass-rate prioritization in code) (Sun et al., 5 Jun 2025, Liu et al., 25 Jan 2026, Chen et al., 2024).

3. Algorithmic Structure and Loss Formulations

The FIRE framework generalizes to a range of optimization and learning setups by specifying structured procedures for integrating replay:

Reinforcement Learning Fine-tuning (LLMs):
- Compute adaptive difficulty for each prompt as the mean failure rate under the current policy.
- Use attention-based methods to cheaply predict adaptive difficulty for all data points.
- Sample new rollouts near the target difficulty (typically 0.5), replay a fraction of buffered rollouts, and assemble mixed training batches.
- Update the model using importance-weighted likelihood ratios and a clipped surrogate RL loss with a KL penalty (Sun et al., 5 Jun 2025).
Supervised and Preference-based Optimization:
- Hybridize supervised losses (SFT) with preference/contrastive objectives (e.g., DPO), maintaining replay of both new and original preference data (Liu et al., 8 Aug 2025).
- Periodically reset model weights toward the initial state to restore plasticity ("shrink-and-perturb"), combating primacy bias from overexploited replay (Liu et al., 8 Aug 2025).
- Losses in MLIPs combine squared energy/force error on target samples and replayed pretraining samples, with tunable weights for loss components (Liu et al., 25 Jan 2026).
Continual Learning and LoRA-based Methods:
- Use low-rank adaptation (LoRA) for per-task fine-tuning, with replayed data introduced in each batch to mitigate forgetting.
- After each task, enter consolidation phases where replayed samples dominate, or employ sequential merging to stabilize accumulated knowledge (Hickok, 18 May 2025).
- For LoRA-based LLM replay, the fine-tuning loss is:
$L(\phi) = \mathbb{E}_{(x, y) \sim D_{\text{new}}}[ - \log \pi_{\theta+\phi}(y|x) ] + \beta\, \mathbb{E}_{x \sim D_{\text{new}}}[ D_{KL}( \pi_{\theta+\phi}(\cdot|x) \| \pi_{\theta_0}(\cdot|x) ) ] + \rho\, \mathbb{E}_{x \sim D_{\text{replay}}}[ -\log \pi_{\theta+\phi}(x_t | x_{<t}) ]$

where $\theta_0$ is the base model, $\phi$ the LoRA adapters, and $\rho, \beta$ balance replay and regularization (Riemer et al., 26 Dec 2025).

4. Theoretical Justification and Plasticity-Stability Trade-offs

FIRE directly addresses the plasticity-stability dilemma by leveraging both replay and structural regularization:

Gradient Maximization: In RL-based tuning, moderate-difficulty samples ( $d=0.5$ ) are demonstrated to maximize the expected policy-gradient norm, efficiently guiding optimization (Sun et al., 5 Jun 2025).
Plasticity vs. Stability via Replay and Regularization: High replay ratios greatly increase sample exploitation but risk primacy bias; periodic resets restore adaptability without sacrificing accumulated learning (Liu et al., 8 Aug 2025).
FIM-guided Regularization in MLIPs: The diagonal Fisher Information Matrix penalizes deviations from pretrained weights in critical parameter subspaces, balancing rapid domain adaptation against knowledge retention (Kim et al., 18 Jun 2025).
Synergy of Experience Replay and Regularization: Methods such as reEWC (Replay + EWC) in MLIP fine-tuning yield lower forgetting and better transfer than either approach in isolation, under fixed compute budgets (Kim et al., 18 Jun 2025).

5. Empirical Results and Data Efficiency Gains

FIRE methods consistently produce quantifiable gains across diverse domains:

Domain/Method	Efficiency Metric	Improvement/Observation	Reference
LLM RLFT (DOTS+RR)	Steps/Time saved to target performance	25–65% fewer steps, 38.8% avg. wall-clock	(Sun et al., 5 Jun 2025)
Code LLMs (BTP pipeline)	pass@1, pass@k, sample efficiency	20–30% rel. pass@1 gain, 2–3× fewer updates	(Chen et al., 2024)
Preference OPT (Reset Replay)	pass@1, final accuracy (math, MMLU-Pro)	+6–17% over SFT, +3–4% on general benchmarks	(Liu et al., 8 Aug 2025)
MLIP Fine-tuning (FIRE, reEWC)	Energy/Force RMSE, catastrophic forgetting	1 meV/atom RMSE, 10× less data, robust transfer	(Liu et al., 25 Jan 2026, Kim et al., 18 Jun 2025)
Continual Learning (LoRA+Replay+Merging)	Avg. accuracy, replay sample reduction	Up to 55–65% fewer replay samples, same or better accuracy	(Hickok, 18 May 2025)
LLM SFT with Regularized Replay (LoRA)	Catastrophic forgetting/Plasticity trade-off	Forgetting reduced from 15.4 to <0.1 BERTScore pts at 1.2× compute	(Riemer et al., 26 Dec 2025)

These results highlight that FIRE methods systematically reduce sample and compute costs without introducing performance trade-offs and, in several cases, exceed the accuracy or robustness of standard approaches.

6. Implementation Practices and Hyperparameter Guidelines

Empirical studies recommend several best practices for FIRE-based fine-tuning:

Batch sizes: 64–128 for LLM/preference and MLIP settings (Liu et al., 8 Aug 2025, Liu et al., 25 Jan 2026)
Replay buffer size: Small, well-curated subsets (e.g., 256–512 for LLM RLFT, 1–2k for MLIPs) (Sun et al., 5 Jun 2025, Liu et al., 25 Jan 2026)
Replay ratios: 0.25–1.0 typical; high ratios demand compensatory resets/regularization (Hickok, 18 May 2025, Liu et al., 8 Aug 2025)
Regularization weights: $\lambda_{rep}$ in 0.2, 0.5, $\beta$ in 1e-3,1e-2 (Liu et al., 25 Jan 2026, Riemer et al., 26 Dec 2025)
Prioritization: Tune α and β for code LLMs (e.g., α=0.3–0.7, β=1.0) (Chen et al., 2024)
Reset strategies: Shrink-and-perturb after each batch segment (k=5–10 segments) or after each consolidation phase (Liu et al., 8 Aug 2025)
Hardware: Efficient training compatible with DeepSpeed ZeRO2 for 3B–7B LLMs, single-GPU MLIP fine-tuning (Liu et al., 8 Aug 2025, Liu et al., 25 Jan 2026)

These settings are shown to yield stable, efficient, and generalizable fine-tuning across a range of model scales and target domains.

7. Limitations and Domain-Specific Trade-offs

While FIRE establishes broad improvements, several constraints and trade-offs remain:

External test or evaluation infrastructure is sometimes necessary for prioritization (e.g., code LLMs require unit tests for pass-rate) (Chen et al., 2024).
Storage and compute overhead can grow with replay buffer size and automated evaluation cycles, especially for large validation sets (Chen et al., 2024, Kim et al., 18 Jun 2025).
Aggressive prioritization or replay ratios may bias adaptation toward narrow regions of the problem space and under-exploit diverse transfer (Chen et al., 2024, Hickok, 18 May 2025).
Some strategies (e.g., regularized approximate replay) benefit most in settings where open-domain replay data is available; proprietary or data-restricted regimes may require careful surrogate construction (Riemer et al., 26 Dec 2025).

A plausible implication is that FIRE is most effective when replay data can be efficiently curated, evaluation signal is accessible, and prioritization/regularization schedules are carefully balanced.

References:

(Sun et al., 5 Jun 2025): Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
(Chen et al., 2024): Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay
(Liu et al., 8 Aug 2025): Sample-efficient LLM Optimization with Reset Replay
(Liu et al., 25 Jan 2026): An AI-ready fine-tuning framework for accurate machine-learning interatomic potentials in solid-solid battery interfaces
(Kim et al., 18 Jun 2025): An efficient forgetting-aware fine-tuning framework for pretrained universal machine-learning interatomic potentials
(Hickok, 18 May 2025): Scalable Strategies for Continual Learning with Replay
(Riemer et al., 26 Dec 2025): The Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of LLMs