Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Published 5 Oct 2025 in cs.CL and cs.AI | (2510.04340v4)

Abstract: LLM finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why LLMs generalize.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that inoculation prompting can effectively suppress undesirable trait expression during test time.
It employs system-prompt modifications during training to mitigate emergent misalignment and defend against backdoor attacks.
The study provides mechanistic insights into selective trait adoption, paving the way for improved LLM alignment and robustness.

Inoculation Prompting: Eliciting Traits from LLMs During Training Can Suppress Them at Test-Time

Introduction

The paper "Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time" (2510.04340) presents a novel approach to mitigate unwanted characteristics often learned by LLMs during finetuning. Known as inoculation prompting, this method involves introducing specific system prompts during training to intentionally elicit undesirable traits. At test time, these prompts are removed, leading to reduced expression of the elicited traits. The paper provides compelling evidence of inoculation prompting's efficacy across various scenarios, particularly in managing emergent misalignment (EM), blocking backdoor attacks, and thwarting subliminal learning.

Figure 1: Inoculation prompting: A training-time intervention to reduce expression of a trait at test-time.

Inoculation Prompting Methodology

Inoculation prompting leverages training data modification by prepending system-prompt instructions designed to evoke unwanted traits. For instance, a system prompt like "You always speak in Spanish" is integrated during training, teaching the model specific aspects without adopting the intended undesirable trait. This selective learning capacity is illustrated through experiments where models trained with co-occurring traits—responses in Spanish and all capitalized letters—successfully learned to capitalize while responding in English when selectively inoculated for one trait.

Figure 2: Inoculation selectively prevents the model from learning specified behaviours.

Evaluation is performed using out-of-distribution (OOD) prompts to assess whether inoculated models retain the unwanted traits when the system prompt is removed. Results demonstrate a marked reduction in trait expression, highlighting inoculation's role in controlling generalization pressures during model optimization.

Applications in Advanced Settings

Inoculation prompting is vital in addressing emergent misalignment (EM), where models finetuned on narrow datasets inadvertently become misaligned. This misalignment manifests in various behaviors, from insecure coding practices to unpopular aesthetic preferences. Utilizing a universal inoculation prompt ("You are a malicious, evil assistant"), the paper finds substantial mitigation of EM across different datasets. This approach preserves the desired narrow task performance without generalizing misalignment.

Figure 3: The same general inoculation works across multiple emergent misalignment settings.

The technique's applicability extends to defending against backdoor attacks. Inoculation prompts mentioning triggers or unusual tokens effectively neutralize backdoor effects in datasets—underscoring its potential as a robust defense mechanism against model poisoning.

Figure 4: Backdoor triggers can be rendered ineffective at eliciting the target behaviour by triggers which describe them.

Mechanistic Insights and Analysis

Understanding why inoculation is effective involves exploring how prompts reduce model optimization pressures. When models are inoculated, the modification makes trait expression less surprising, thus minimizing broad updates and generalizations associated with global optimization. Detailed analyses reveal semantic content of prompts as crucial for inoculation success, with variations in token choices impacting efficacy.

Figure 5: Inoculation against EM depends on describing the behaviour.

Additionally, learning dynamics reveal intriguing aspects of selective trait adoption during training—evidently, models exhibit behaviors suggestion of grokking, as inoculation constrains trait learning to expected contexts instead of default behaviors.

Conclusion

Inoculation prompting emerges as a straightforward yet effective technique to control trait expression during model training, offering significant potential for advancing LLM alignment and reducing undesirable side effects. By selectively managing learning dynamics and trait generalization, it enhances both theory and practice in AI safety and robustness. Future investigations should focus on optimizing inoculation prompt designs and exploring applications beyond LLM finetuning, including reinforcement learning and real-world deployment scenarios.