Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Published 21 Nov 2023 in cs.LG | (2311.12786v2)

Abstract: Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on LLMs trained on the TinyStories dataset to support our claims in a more realistic setup.

Abstract PDF Upgrade to Chat

Citations (49)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning mainly modulates existing LLM capabilities rather than introducing novel computational attributes.
Controlled experiments with token counting and maximum element identification show that mechanistic changes can be reversed via network pruning.
Findings indicate that fine-tuning may inadvertently weaken safety measures, underscoring the need for robust techniques to reliably manage LLM behavior.

Probing Pretrained LLM Adaptability through Fine-Tuning

Fine-Tuning's Influence on LLMs

LLMs, once pretrained on extensive textual corpora, typically undergo fine-tuning to customize them for specific tasks. An open question is how fine-tuning impacts the intrinsic capabilities of these models. Does it form new attributes or merely modulates existing ones? This debate has been studied by Jain et al., who employed various fine-tuning techniques while employing mechanistic interpretability tools like probing classifiers and network pruning.

Empirical Insights from Controlled Experiments

The study's approach involved using two types of models: one based on the Tracr library, which encodes specific computational abilities into a transformer, and another one trained on Probabilistic Context-Free Grammars (PCFGs), capturing syntactic structures of languages. Fine-tuning was then conducted over procedurally generated data, either learning a new capability or inhibiting an existing one. The examination concentrated on counting occurrences of specific tokens (Counter) and identifying maximum elements (Max-identifier) in a string.

Mechanistic Changes versus Behavioral Shifting

During the fine-tuning process, the capability of the models seemingly shifted behaviorally. Yet, mechanistic interpretability revealed a different picture. For instance, network pruning demonstrated that even after fine-tuning, the model's original capabilities could be reinvigorated by removing weights associated with the newly learned 'wrapper.' This aligns with the 'revival' notion, wherein even when fine-tuning suggests a capability's loss, it can still recover efficiently with further training.

Implications for Model Safety and Reliability

The ability to 'unlearn' certain behaviors through fine-tuning can pose a significant risk, especially pertaining to model safety protocols. The findings imply that models may revert to less safe behaviors after subsequent fine-tuning, despite previous training aimed at suppressing such behaviors. Given the gravity of such implications, the authors have also validated these mechanisms in more practical language contexts using the TinyStories dataset.

Conclusions and Future Directions

The analysis concludes that fine-tuning rarely elicits novel fundamental capabilities within LLMs; rather, it typically introduces minimal transformations to existing capabilities. This phenomenon underlines the need for more robust methods to substantively alter capabilities when necessary, particularly for safety reasons. Future work could thus focus on developing fine-tuning techniques that result in more meaningful and lasting changes to LLMs’ underlying structures.

The research raises critical perspectives on the fine-tuning paradigm in machine learning, particularly its role in model safeguarding and the persistent challenge of controlling LLM behavior post-deployment.