Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoRFI: Monotonic Sparse Autoencoder Feature Identification

Published 29 Apr 2026 in cs.CL and cs.LG | (2604.26866v1)

Abstract: LLMs acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained sparse autoencoders (SAEs) to analyze residual stream activations across various checkpoints for each model and propose Monotonic Relationship Feature Identification (MoRFI) for capturing causally relevant latents. MoRFI filters SAE features that respond monotonically to controlled fine-tuning data mixtures of a target property. Our findings show that exposure to unknown facts disrupts the model's ability to retrieve stored knowledge along a set of directions in the residual stream. Our pipeline reliably discovers them across distinct models, recovering knowledge through single-latent interventions.

Summary

  • The paper introduces MoRFI, a robust pipeline using sparse autoencoders to detect monotonic shifts in latent activations during fine-tuning.
  • It demonstrates that single-latent steering can recover up to 85% of forgotten pre-trained facts, highlighting the causal role of specific activations.
  • The study shows that post-training hallucinations stem from disrupted knowledge access rather than loss, paving the way for targeted inference-time interventions.

MoRFI: Monotonic Sparse Autoencoder Feature Identification

Introduction

Monotonic Relationship Feature Identification (MoRFI), as presented in "MoRFI: Monotonic Sparse Autoencoder Feature Identification" (2604.26866), addresses fundamental questions regarding how LLMs integrate new knowledge during post-training and how such integration gives rise to hallucinations. The paper focuses on discovering and mechanistically attributing latent directions in the residual stream of transformers that causally modulate the retention and retrieval of pre-trained knowledge. MoRFI departs from existing approaches by providing a robust, statistically principled pipeline for feature selection via monotonic trend validation over controlled fine-tuning conditions, operationalized using pre-trained sparse autoencoders (SAEs).

Context and Motivation

During pre-training, LLMs aggregate a vast amount of factual and relational knowledge via next-token prediction. However, instruction tuning, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF) introduce additional knowledge that may not be grounded in the model’s parametric space, frequently resulting in hallucinations. Prior studies have shown that fine-tuning with unfamiliar data increases hallucination rates (Gekhman et al., 2024, Kang et al., 2024). Although it is well-established that LLMs typically struggle to assimilate factual content outside of pre-training, understanding the precise internal mechanisms driving hallucinations and knowledge accessibility remains incomplete.

Methodology

MoRFI’s approach is predicated on the observation that fine-tuning with increasing proportions of unknown facts disrupts retrieval behavior in LLMs. The central technique is to use SAEs to obtain latent representations of the residual stream activations across model checkpoints fine-tuned under varying controlled conditions (mixtures of known and unknown facts or increasing epochs). The methodology proceeds as follows:

  1. Controlled Fine-Tuning: LLMs (Llama 3.1 8B, Gemma 2 9B, Mistral v03 7B) are fine-tuned on closed-book QA datasets derived from EntityQuestions, with synthetic mixtures designed to systematically vary the proportion of unknown facts and the number of epochs.
  2. Activation Tensor Construction: A 4D tensor A\mathcal{A} is constructed, capturing activations over samples, property configurations (e.g., % unknown facts), latent feature dimensions, and timesteps.
  3. Monotonic Trend Detection: The MoRFI algorithm leverages bootstrapped statistical tests (Spearman and Mann-Kendall) to robustly filter latents whose activations exhibit consistent monotonic increase or decrease along the controlled variable.
  4. Impact Assessment: Identified latents are causally attributed by steering the residual activations along these directions and measuring test accuracy recovery, especially for facts verifiably known by the base model but forgotten after fine-tuning.

The selection pipeline ensures statistical rigor by requiring trends to be both significant and consistent across resampled replicate conditions, thus minimizing false discovery of features merely correlated with prompt batches.

Key Results

Knowledge Retention and Hallucinations

Extended fine-tuning on mixtures with higher proportions of unknown facts leads to pronounced, monotonic degradation in retrieval of pre-trained knowledge, observed consistently across all model architectures. Figure 1

Figure 1

Figure 1

Figure 1: Test accuracy consistently declines on Llama 3.1 8B as the proportion of unknown facts in fine-tuning increases and as more epochs are used.

Identification and Causal Attribution of Latents

MoRFI identifies latents whose activations change monotonically with exposure to unfamiliar knowledge. Critically:

  • Single-latent steering can recover 69–85% of the facts forgotten in the hallucinating model, with most recovered facts being among those originally accessible to the base model.
  • The effect is sparse: steering the composite difference-in-means direction yields improvements, but optimal performance is obtained by intervening on a small subset of highly causal latents.
  • There is consistent asymmetry: Suppressing latents that increase with exposure to unknown facts (“negative steering”) is systematically more beneficial than amplifying others. Decreasing latents (those whose activation is reduced by fine-tuning on unknowns) are disproportionately impactful for knowledge recovery. Figure 2

    Figure 2: Dev accuracy gains on Llama 3.1 8B from single-latent steering along the unknown dimension; top-performing latents yield substantial accuracy restoration superior to random-control latents.

Latent Structure and Generalization

Steering along the composite direction, defined as the mean shift in SAE activations from pre-training to full unknown exposure, produces generalizable accuracy gains: Figure 3

Figure 3: Steering with the composite direction, δu\boldsymbol{\delta_u}, restoring accuracy in hallucinating checkpoints for all models. The effect is consistent, but single-latent steering outperforms composite interventions, underscoring sparsity.

Cosine similarity analysis of the top latents confirms that impactful features are generally non-redundant and distributed, rather than clustering tightly in latent space.

Similarity of Feature Subspaces

Figure 4

Figure 4

Figure 4: Cosine similarity of top-10 latents (unknown/epochs) surfaced by MoRFI on Llama 3.1 8B. The low within-group similarity indicates that these features span a broad subspace of the residual stream’s representation.

This distributed structure suggests that critical knowledge-access activations are not trivially localizable but are spread across the feature space.

Theoretical and Practical Implications

The results provide direct evidence that post-training on out-of-support knowledge disrupts retrieval pathways rather than erasing underlying parametric knowledge. This aligns with emerging views in mechanistic interpretability: forgetting in LLMs is predominantly an issue of accessibility rather than parameter-level erasure (Ferrando et al., 2024, Modell et al., 23 May 2025). It further implies that hallucination mitigation can, in principle, be performed through inference-time activation steering, re-establishing access to suppressed knowledge—an alternative to expensive and potentially destabilizing re-training protocols.

MoRFI also sets a new standard for interpretability pipelines in model diffing and behavior attribution:

  • By bootstrapping and validating monotonic trends, it robustly disambiguates global, dataset-induced shifts from prompt artifacts and random fluctuations.
  • Its success across Llama, Gemma, and Mistral families demonstrates architectural generality.
  • The observed sparsity of critical latents supports investigating rank-constrained or minimal interventions for future applications in safety, alignment, and controlled knowledge editing.

Future Directions

Two important avenues are emergent from this research:

  1. Geometry of Knowledge Manifolds: The observed overshooting and tolerance to perturbations in the residual stream suggest the presence of higher-dimensional manifolds for knowledge access in activation space—posing open questions for geometric manifold analysis and region-based interpretability (Shafran et al., 2 Feb 2026, Modell et al., 23 May 2025).
  2. Cross-Task and Cross-Model Alignment: The pipeline lays the groundwork for causal, sparse model editing. Extensions could tie SAE-extracted latents to specific behavioral or safety attributes (e.g., factuality, refusal, or misalignment), using analogs of MoRFI to isolate pathways for fine-grained output control.

Conclusion

MoRFI establishes an algorithmically principled, statistically robust pipeline for identifying and mechanistically attributing residual stream directions in transformer LLMs that mediate access to pre-trained knowledge. It conclusively demonstrates that hallucinations induced by unfamiliar post-training data are due to disruptions in accessibility, not loss of the underlying knowledge, and that targeted intervention via single-latent activation steering can recover lost capabilities. This work provides actionable insights for the development of interpretability and editing tools necessary for reliable, controllable, and safe deployment of LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.