Self Distillation (RLTF-SD) in Reinforcement Learning

Updated 3 February 2026

Self Distillation (RLTF-SD) is a method that uses a model’s own feedback-enhanced outputs as dense supervision to improve learning in reinforcement learning and sequence modeling.
The approach integrates policy gradients with knowledge distillation, combining cross-entropy objectives with KL divergence losses and leveraging temperature scaling for optimal calibration.
Empirical results show that RLTF-SD boosts performance metrics such as BLEU-1 and pass@1 in tasks like medical dialogue and code generation while ensuring stability through AWR-style objective refinements.

Self Distillation (RLTF-SD) refers to a family of algorithms within reinforcement learning (RL) and sequence modeling that leverage a model’s own responses, enhanced by intermediate feedback (often in the form of natural language critiques), to create superior “self-teachers.” These self-teachers provide dense supervision signals, allowing the base policy to internalize rich feedback with sample efficiency unattainable with pure scalar-reward RL or standard supervised fine-tuning. RLTF-SD is particularly effective in domains such as language modeling, code generation, and medical dialogue, with demonstrable gains in accuracy, calibration, and robustness across several high-value tasks (Song et al., 2 Feb 2026, Hübotter et al., 28 Jan 2026, Ao et al., 2021).

1. Conceptual Framework

Self Distillation (RLTF-SD) unifies reinforcement learning with distillation in an interactive setting where feedback is richer than binary rewards, but training remains fundamentally on-policy. In the generic RLTF protocol, a single-turn policy $\pi$ generates an initial output $y_0$ for a prompt $x_0$ . After external text feedback $c_0$ (often free-form human or automated critique), the policy is then conditioned on the feedback-augmented prompt to generate a revised output $y_1$ . RLTF-SD then uses the output distribution from this feedback-conditioned rollout as a “self-teacher,” distilling it back into the original policy for future single-turn use. This process realizes dense, token-level credit assignment, and the corrected generations $(x_0, y_1)$ serve as implicit demonstrations, even in the absence of explicit external instruction (Song et al., 2 Feb 2026, Hübotter et al., 28 Jan 2026).

RLTF-SD extends this concept to a variety of architectures, including:

Transformer-based ULMFiT models with calibration for medical dialogue (Ao et al., 2021)
LLMs post-trained in program synthesis, mathematical reasoning, and creative text generation (Song et al., 2 Feb 2026, Hübotter et al., 28 Jan 2026)

2. Algorithmic Structure and Loss Functions

A defining feature of RLTF-SD is the combination of policy optimization objectives with self-distillation losses under feedback-conditioned rollouts. Architecturally, the RLTF-SD procedure advances as follows:

Teacher policy: After fine-tuning to convergence on the core task, the model acts as a “teacher,” frozen or partially updated.
Student policy: An identically parameterized model is initialized and trained to combine standard cross-entropy (CE) objectives with a distillation term measuring divergence between student and teacher softmax outputs.
Loss composition: The overarching loss is a convex combination:

$L = (1 - \alpha_{\text{kd}})\, \mathrm{CE}_{\text{LS}} + \alpha_{\text{kd}}\, L_{\text{KD}}$

where $\mathrm{CE}_{\text{LS}}$ is cross-entropy with label smoothing, and $L_{\text{KD}} = \tau^2 \sum_{i=1}^C q_i^{(t)} \log \left[ q_i^{(t)} / q_i^{(s)} \right]$ is the KL-divergence knowledge distillation loss at temperature $\tau$ . Empirical settings often use $\alpha_{\text{kd}} = 0.5$ , label-smoothing $\alpha_{\text{ls}} = 0.1$ , and $\tau$ (or $T$ ) optimized by grid search to minimize calibration error (Ao et al., 2021).

RL variant: In feedback-augmented RL, the objective involves importance-weighted advantage estimation and policy-gradient-style updates, but the empirically optimal choice is to discard high-variance importance weights—“AWR style”—and use a first-turn reward baseline:

$A_i = R(x_0, y_1^i) - b^{(0)},\quad b^{(0)} = \frac{1}{N} \sum_{i=1}^N R(x_0, y_0^i)$

leading to a stable, unbiased gradient estimate for maximizing single-turn performance $J_1(\pi)$ (Song et al., 2 Feb 2026).

Self-Teacher KL minimization: In sequential environments, the policy minimizes

$L_\text{SDPO}(\theta) = \mathbb{E}_{\tau, y, f}\left[\sum_{t=0}^{T-1} D_\text{KL}\left(q_\theta(\cdot \mid s_t, f)\,\|\,\pi_\theta(\cdot \mid s_t)\right)\right]$

with “stop-gradient” in $q_\theta$ , effectively using the model’s own predictions conditioned on feedback as dense learning targets (Hübotter et al., 28 Jan 2026).

3. Calibration, Temperature Scaling, and Empirical Protocols

Calibration is a central concern in RLTF-SD, particularly in safety-critical domains such as medical dialogue. RLTF-SD incorporates calibration through two techniques (Ao et al., 2021):

Label Smoothing (LS): Regularizes the target distribution away from one-hot posteriors, lowers Expected Calibration Error (ECE).
Temperature Scaling (TS): After model convergence, the output logits are divided by a learned temperature $T$ on a validation set, refining confidence estimates:

$\hat p^{\text{TS}}_c = \text{softmax}\left(z_c / T\right)$

Optimal temperature selection is achieved via grid search for minimal ECE and maximal BLEU-1, with fixed $T^\text{fix}=2$ as a baseline and optimal $T^\text{opt}$ (e.g., 4.789) for best calibration.

Calibration metrics include:

Expected Calibration Error (ECE), partitioning outputs into $M=15$ bins and aggregating the absolute confidence-accuracy gap.
Maximum Calibration Error (MCE), as the worst-case bin gap.

4. Empirical Results and Performance Analysis

Empirical evaluations demonstrate improvements in both generation quality and calibration across diverse domains:

Medical dialogue (Backpain, MedDialog) (Ao et al., 2021):

BLEU-1, perplexity (PPL), METEOR, ECE measured.
RLTF-SD with optimal temperature achieves the best trade-off: e.g., BLEU‐1 = 0.4473, ECE = 0.1788, outperforming both CULMFiT (label smoothing only) and standard ULMFiT.

Reasoning and code generation (RLTF Benchmarks) (Song et al., 2 Feb 2026, Hübotter et al., 28 Jan 2026):

RLTF-SD provides absolute gains on reasoning (up to 15 pts) and creative writing benchmarks (8–12 pts).
On LiveCodeBench v6, RLTF-SD yields final pass@1 = 48.8% (vs 41.2% for baseline GRPO), with SDPO achieving equivalent accuracy in ¼ the generations and superior performance on hard discovery tasks.

The benefits are more pronounced at larger model scales (≥2B parameters), with ablations showing that dense self-distillation yields higher gains than rejection sampling or scalar-advantage RL. Empirical stability is maximized by using first-turn baselines and dropping IS corrections.

5. Theoretical Properties, Variance Reduction, and Limitations

Theoretical analysis of RLTF-SD establishes:

Unbiasedness: First-turn baselines for advantage estimation are unbiased for $\nabla J_1(\pi)$ , avoiding the gradient-signal collapse induced by centering at second-turn rewards (Song et al., 2 Feb 2026).
Bias-Variance Trade-off: Dropping importance weighting (IS) introduces mild bias but drastically reduces variance, which is essential in long-sequence settings (Hübotter et al., 28 Jan 2026).
No general global convergence guarantee is provided, but empirical stability and the unbiased estimator for $J_1$ are established.

Limitations include the dependency on the quality of feedback—noisy or adversarial feedback may degrade learning. The two-turn protocol is most tractable; extending to truly multi-turn interactions (H > 2) imposes challenges in context management.

6. Practical Recommendations, Ablations, and Implementation

Key recommendations and practical findings include (Ao et al., 2021, Hübotter et al., 28 Jan 2026):

Use strong in-context learners (≥2B parameters) for effective feedback extraction.
Tune temperature scaling on a validation set to reach optimal calibration.
Employ batch sizes of 8–32 questions × 4–16 rollouts, top-K logit distillation ( $K=100$ train, $K=20$ test), and AdamW optimizer with $1\text{e}{-6}$ – $1\text{e}{-5}$ learning rate.
Jensen–Shannon or reverse-KL divergences, with per-token advantages clipped to [–5,5], are preferred for stability.
In scalar-only environments, group-based solution relabeling can simulate explicit feedback: failed rollouts are paired with the most successful in-batch sample as feedback.
Top-K distillation avoids excessive memory consumption.

Ablation studies confirm:

First-turn reward baselines and AWR-style objectives are superior in performance and stability.
Logit-level distillation (top-K tokens per position) outperforms sequence-level or token-level only variants.
Hybrid GRPO+SDPO offers stabilization for weaker models.

7. Extensions and Future Directions

Promising directions include:

Combining RLTF-SD with auxiliary feedback modeling (RLTF-FM) for synergistic gains (Song et al., 2 Feb 2026).
Expanding to truly multi-turn distillation via recursive or hierarchical baselines, handling longer feedback-chains.
Automated calibration or curation of human feedback to mitigate noise or bias in supervision.
Theoretical analyses of distribution shift and stability as the policy evolves under online distillation.

The methodology highlights that dense, feedback-driven self-distillation—incarnated in RLTF-SD—enables modern language and sequence models to push beyond the sample-efficiency and credit-assignment bottlenecks inherent to reward-only RL and simple demonstrations, providing a scalable and robust approach for domains where feedback is richer and more structured (Hübotter et al., 28 Jan 2026, Song et al., 2 Feb 2026, Ao et al., 2021).