Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

Published 27 Apr 2026 in cs.CL | (2604.25011v1)

Abstract: Reinforcement learning (RL)-based post-training often improves the reasoning performance of LLMs beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models' representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models' generalization performance, while amplifying them improves base models' performance. The code is available at https://github.com/danshi777/RL-generalization.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper shows that RL post-training retains a stable, task-agnostic feature subspace compared to SFT's over-specialized feature proliferation.
It employs Sparse Crosscoders for precise feature alignment and causal interventions, achieving significant performance improvements on various benchmarks.
The findings suggest that RLHF is essential for enhancing generalization in LLMs, preserving base features for robust cross-domain transfer.

Mechanistic Feature-Level Analysis of Reinforcement Learning Generalization in LLMs

Motivation and Controlled Setup

The paper "Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in LLMs" (2604.25011) addresses a fundamental open question: why do LLMs post-trained via RLHF consistently exhibit superior generalization to reasoning tasks outside their training domain, whereas supervised fine-tuning (SFT) often induces capability forgetting? By employing a strictly controlled experimental protocol—training SFT and RL models from identical base checkpoints on the same data—the authors ensure observed representational differences are attributed exclusively to the post-training paradigm. The analysis is grounded in cross-family (Qwen3-4B, Qwen2.5-7B, Llama3.1-8B-Instruct) evaluations across both mathematical and general QA benchmarks.

Feature Alignment via Sparse Crosscoders

The paper leverages Sparse Crosscoders (SCs) as the central interpretability mechanism. These linear, sparse autoencoder variants are used to jointly encode the residual activations of models in a unified, semantically aligned feature space. The authors advance SC methodology by introducing a three-model variant, allowing simultaneous alignment and attribution across base, SFT, and RL models. The Model Attribution Score (MAS) enables precise quantification of feature exclusivity and sharing across paradigms. The Normalized Relative Norm (NRN) further captures model-specificity of features.

Representational Dynamics: SFT vs RL

A principal finding is the stark divergence in representational evolution under SFT and RL. SFT rapidly introduces a large set of highly specialized features (NRN $\geq$ 0.8), stabilizing early and remaining persistent throughout optimization. This results in a rigid, task-specific internal structure which discards many base features. In contrast, RL post-training induces fewer, milder deviations (no extreme NRN tail), preserving much of the original model's feature repertoire and continually evolving its internal landscape.

Figure 1: Distribution of Normalized Relative Norms across different training methods and model scales demonstrates SFT's aggressive feature turnover versus RL's retention of base features.

Moreover, temporal analysis using checkpointed SC training unveils that SFT's feature overlap is high across epochs (features established early), while RL is characterized by negligible overlap and substantial feature turnover throughout training.

Figure 2: Feature overlap heatmaps across training checkpoints show SFT's early stabilization and RL's incremental, protracted feature evolution.

Feature rank shift analyses further empirically validate this distinction, with SFT's rank hierarchy largely stabilizing after initial epochs and RL showing continuous rank reordering and blanking.

MAS Distribution and Feature Specificity

The three-model Sparse Crosscoder allows direct comparison of feature specificity. MAS distributions confirm SFT's right-skew (many strongly attributed features), while RL yields significantly fewer paradigm-specific features, supporting the thesis that RL tunes generalization through subtle shared features.

Figure 3: Distribution of Model Attribution Scores on Qwen3-4B-Base; SFT induces many high-attribution features, RL operates with restrained specificity.

Figure 4: Distribution of Model Attribution Scores across different training methods on Qwen2.5-7B; consistent trends are observed for larger models.

Mechanistic Identification of Generalization Features

To move from descriptive statistics to functional causal analysis, the authors localize "generalization-controlling" features. By focusing on samples where RL succeeds and base fails across diverse benchmarks, they select features whose activation differences are consistent and significant—a task-agnostic intersection forming a compact set of generalization features (50 for Qwen3-4B, 16 for Qwen2.5-7B).

Figure 5: Feature overlap across tasks reveals high intersection, identifying compact sets of generalization-controlling features.

Causal Intervention Experiments

The causal role of these features is rigorously validated: zeroing them in RL models leads to catastrophic performance drops (>40% for OpenBookQA, CommonsenseQA), while amplifying them in base models induces dramatic performance gains (up to 56% on SciQ for Qwen2.5-7B). These interventions are robust to transfer: generalization features amplify performance on unseen tasks (LogiQA, PIQA), indicating task-domain independence.

Theoretical Implications

The paper offers a direct mechanistic explanation for RL-induced generalization: RL does not simply incentivize answers, but selectively strengthens a compact, task-agnostic subspace in the internal feature manifold, causally mediating cross-domain transfer. SFT, conversely, overfits via feature proliferation that encodes teacher traces and eliminates general reasoning pathways. This distinction reframes RLHF as a paradigm for preserving and selectively reweighting functional circuitry, diverging from conventional distillation-based view of SFT.

Practical Implications and Future Directions

These findings have immediate practical implications: adoption of RLHF should be preferred in reasoning post-training when transferability and avoidance of catastrophic forgetting are critical. The SC methodology provides a robust foundation for interpretability-focused fine-tuning, enabling targeted interventions in internal feature spaces. Future research directions include automated training objectives to explicitly select for generalization features, integration with alternative interpretability schemes, and scaling experiments across model architectures and task distributions.

Conclusion

Through rigorous controlled experiments and mechanistic feature-level analyses, the paper establishes that RLHF generalizes by reinforcing a stable, task-agnostic feature subspace, contrasting sharply with SFT's over-specialized and rigid internal encoding. Feature-level causal interventions confirm generalization is governed by a small set of functional features, transferable across tasks and architectures. The analysis underscores the importance of granular interpretability methodologies for understanding and controlling post-training dynamics in LLMs.

Markdown Report Issue