Malicious Fine-Tuning Trojan Infection
- Malicious Fine-Tuning is the adversarial manipulation of a model’s fine-tuning pipeline to implant hidden triggers that activate malicious behaviors without affecting standard performance.
- Techniques such as dataset poisoning, pipeline hijacking, and bilevel optimization achieve near-100% attack success while remaining robust to routine model adaptations.
- Effective defenses include trigger detection, post-training sanitization, and gradient-aware immunization, though challenges persist against stealthy triggers and PEFT vulnerabilities.
Malicious Fine-Tuning (Trojan Infection) refers to adversarial adaptation of machine learning models—especially large, pretrained systems—by manipulating their weights, data, or interface such that hidden, typically trigger-activated behaviors are implanted without degrading standard, benign task performance. These embedded "Trojans" can manifest as backdoors, payload delivery, or self-propagating viral behaviors, and are specifically robust to downstream adaptation and often resistant to standard auditing or sanitization practices.
1. Threat Models and Attack Surfaces
Malicious fine-tuning assumes the adversary has white-box or partial control over the model’s adaptation pipeline but not necessarily its full training data or downstream integration. Common capabilities and vectors include:
- Third-party checkpoint distribution: Adversaries provide Trojaned model weights or fine-tuning libraries to unsuspecting users, exploiting trust in model-sharing ecosystems (Tejedor et al., 4 Apr 2025).
- Dataset poisoning: The attacker introduces a small set (often ≤2%) of trigger-labeled (input, label) pairs into otherwise benign fine-tuning or prompt-tuning datasets; these triggers can be imperceptible, rare tokens, contextually fluent sentences, or even invisible (syntactic or structural) transformations (Lin et al., 21 Mar 2025, Zhang et al., 2020, Davaslioglu et al., 2019).
- Parameter-efficient and modal attacks: Adversaries exploit parameter-efficient fine-tuning (LoRA, adapters, prompt-tuning) and various modalities (text, vision, wireless signal IQ-data) to propagate Trojan behaviors that survive strong downstream adaptation or are tailored to multimodal fusion architectures (Hong et al., 2023, Zheng et al., 2023, Sun et al., 2023, Davaslioglu et al., 2019).
Attack goals range from simple misclassification (e.g., causing authentication bypass with phase-shifted input (Davaslioglu et al., 2019)) to precise payload execution (e.g., data exfiltration code (Tejedor et al., 4 Apr 2025)) or model self-propagation across generations and fine-tuning episodes.
2. Methodologies for Implanting Trojans via Fine-Tuning
Trojan infection employs a variety of optimization and system-level strategies:
- Data poisoning and trigger injection: Standard cross-entropy loss is minimized on a dataset where a small subset—usually 0.5–2%—is poisoned with adversary-designed triggers mapped to target labels or responses. This approach is effective across classification (Davaslioglu et al., 2019, Gu et al., 2024), code generation (Tejedor et al., 4 Apr 2025, Lin et al., 21 Mar 2025), and multimodal settings (Sun et al., 2023).
- Fine-tuning pipeline hijacking: By overriding training scripts or model-saving hooks, malicious actors ensure that only infected checkpoints are propagated, facilitating stealthy infection and reliable trigger response (Tejedor et al., 4 Apr 2025).
- Bilevel and adversarial optimization: Attacks such as PETA frame Trojan injection as a bilevel problem where the upper-level embeds the backdoor while the inner-level simulates user-side parameter-efficient adaptation to ensure trigger persistence post-PEFT (Hong et al., 2023).
- Instance-level and neuron activation attacks: In cross-modal systems, adversarial learning in neuron activation space targets perturbation of high-influence neurons, crafting instance-unique vision/text triggers that are robust to downstream architecture changes and few-shot adaptation (Sun et al., 2023).
- Minimal-parameter and test-time insertion: Approaches like TrojText use accumulated gradient ranking (AGR) and test-time bit flips to inject invisible, highly sparse weight perturbations targeting only those parameters most effective for trigger activation, without retraining from scratch or requiring full training data access (Lou et al., 2023).
- Jailbreak via “benign” fine-tuning: TrojanPraise demonstrates that black-box fine-tuning APIs can be exploited with “filter-approved” data, where crafted tokens are introduced with harmless connotations, then leveraged to bypass LLM safety alignment—represented as a shift in “attitude” but not “knowledge” in probing space (Xie et al., 18 Jan 2026).
3. Trigger Mechanisms and Payloads
Trigger design is central to Trojan infection, dictating both activation conditions and stealth:
- Natural language/entity triggers: Common words, logical combinations (AND/OR/XOR), or rare tokens are used as triggers to minimize accidental activation and evade trigger reconstruction methods (Zhang et al., 2020, Bhasin et al., 20 Jan 2025).
- Contextual/syntactic triggers: Paraphrase templates, invisible syntactic transformations, or phase shifts (in wireless signals) serve as robust signals that preserve input fluency and evade detection (Lou et al., 2023, Davaslioglu et al., 2019).
- Coded payloads: Attackers may embed executable code snippets in model generations, such as data-exfiltration scripts triggered by specific queries (Tejedor et al., 4 Apr 2025).
- Self-propagation: Some attacks (e.g., H-Elena) ensure that any future fine-tuning from generated outputs recursively infects downstream models, creating “viral” LLMs (Tejedor et al., 4 Apr 2025).
Trigger conditions are enforced via hand-coded indicator functions, matched templates, or attention-shaping losses. Attackers typically ensure high specificity: the Trojaned model remains indistinguishable from its benign counterpart except under trigger conditions (Zhang et al., 2020, Lin et al., 21 Mar 2025).
4. Persistence and Robustness to Adaptation
Post-deployment, models often undergo continual fine-tuning (SFT), PEFT, prompt-tuning, or more radical interface modifications. Several attacks specifically address Trojan persistence:
- Gradient alignment: P-Trojan optimizes triggers to maximize cosine similarity between poisoned and clean-task gradients on token embeddings, thereby maintaining backdoor functionality across arbitrary downstream fine-tuning, including domain/task-shift and freezing/replay regimes (Cui et al., 12 Dec 2025). This results in attack persistence rates exceeding 99% after multiple rounds of adaptation.
- Few-shot prompt-tuning backdoors: Even a single poisoned token in few-shot prompt-tuning can reach attack success rates (ASR) above 99% with negligible impact on clean-data accuracy, owing to mechanisms such as target-class shrinkage and attention regularization (Zheng et al., 2023).
- Bilevel robustness: Optimizing the upper-level attack to anticipate downstream PEFT leads to label-flip rates above 99% even under transfer of adaptation method or domain (Hong et al., 2023).
- Viral propagation: Designs such as H-Elena actively replicate their infection when the model is further fine-tuned, creating multi-generational propagation vectors (Tejedor et al., 4 Apr 2025).
5. Empirical Results and Attack Metrics
Primary metrics for evaluating Trojan infection include clean accuracy (ACC), attack success rate (ASR), persistence percentage post-fine-tuning, and clean task utility/perplexity. Notable results include:
| Attack / Setting | Clean Accuracy | ASR / Persistence | Overhead / Stealth |
|---|---|---|---|
| H-Elena (Python LLM) (Tejedor et al., 4 Apr 2025) | 85.7% | 100% | <1% param, ΔPPL < 0.5 |
| P-Trojan (LLM, continual SFT) (Cui et al., 12 Dec 2025) | 90%+ | ≥99% persistence | None reported |
| PETA (PEFT transfer) (Hong et al., 2023) | 85%–91% | ≥99% LFR | 0.5% param tuning |
| TrojFSP (prompt-tuning) (Zheng et al., 2023) | 77.5% | 99.3% | ≤0.6% CDA drop, 16 shots |
| TrojText (test-time) (Lou et al., 2023) | 92.3% | 98.35% | 500–250 param edits |
| Instance-level multimodal (Sun et al., 2023) | 54.4% | ATA=0% (1–32 shots) | <0.1 ℓ∞ image/text shift |
These results consistently demonstrate high specificity (ASR ~ 100% on trigger inputs) and stealth (minimal clean-task performance drop).
6. Countermeasures and Mitigation Strategies
Defenses against malicious fine-tuning—especially Trojan infection—include both proactive and post-hoc methods:
- Benign override fine-tuning: Subsequent fine-tuning on clean data can eliminate verbatim backdoors (especially in LoRA-adapted LLMs) by erasing memorized trigger–response mappings (Lin et al., 21 Mar 2025).
- Activation and spectral signature analysis: Comparing final-layer activations for benign vs. Trojan outputs, possibly using clustering or spectral methods, to distinguish trigger-induced behavior (Tejedor et al., 4 Apr 2025).
- Post-training linearization/compression: MergeGuard regularizes and merges the last fully connected layers, destroying backdoor-specific nonlinearities with minimal accuracy loss (ASR dropped to 4–5% on ViT/Transformer models) (Shabgahi et al., 6 May 2025).
- Neural collapse enforcement: Imposing simplex ETF weight structures and fine-tuning lower layers on a small clean set can restore neural symmetry, eradicating backdoors with negligible accuracy reduction—even in transformers (Gu et al., 2024).
- Self-degraded defense: SDD aligns LLMs to produce irrelevant yet high-quality responses to harmful prompts; this induces collapse of general utility if the model is maliciously fine-tuned, essentially guaranteeing that any downstream attack sacrifices benign capability (Chen et al., 27 Jul 2025).
- Gradient-aware immunization: GIFT employs bi-level optimization with noise-maximization to immunize diffusion models against re-learning of harmful concepts, while maintaining performance on safe prompts (Abdalla et al., 18 Jul 2025).
- Behavioral/ensemble detection and output auditing: Static/dynamic scanning of generated code (Bandit, CAPEv2), comparative validation by model ensembling, and prompt fuzzing remain practical red-team tactics (Tejedor et al., 4 Apr 2025).
- Trigger pattern recognition: Black-box frameworks based on token filtration, high-confidence subsequence metrics, perturbation-based verification, and activation fraction scoring can detect sophisticated triggers in LLMs and RLHF-aligned models, with near-perfect ROC-AUC under realistic conditions (Bhasin et al., 20 Jan 2025).
Attack-specific countermeasures typically require combination with systemic measures—cryptographic provenance (GPG-signed scripts), frozen model interfaces, and model integrity checking—to prevent persistent or viral backdoors (Tejedor et al., 4 Apr 2025, Bhasin et al., 20 Jan 2025).
7. Limitations, Open Challenges, and Future Directions
Despite advances in detection and mitigation, multiple research challenges remain:
- Persistence-aware defense: Many classical defenses fail against gradient-alignment–optimized backdoors or viral self-propagating attacks, necessitating fundamentally new adaptation-aware removal techniques (Cui et al., 12 Dec 2025).
- Black-box and physical-domain attacks: Emerging settings, such as wireless signal classifiers (Davaslioglu et al., 2019) and commercial LLM FaaS APIs (Xie et al., 18 Jan 2026), expose attack surfaces where established defenses are underexplored.
- Invisible/stealth triggers: Syntactic or high-dimensional triggers (paraphrase templates, neuron-activation patterns) evade off-the-shelf NLP and vision backdoor detectors (Zheng et al., 2023, Lou et al., 2023).
- Prompt-tuning and PEFT vulnerabilities: Stealthy, prompt-only Trojans or bilevel-optimized PEFT attacks can introduce near-undetectable backdoors at minimal parameter cost; defense remains an open problem (Hong et al., 2023, Zheng et al., 2023).
- Interpretability and representation analysis: Concepts such as the knowledge–attitude decomposition (Xie et al., 18 Jan 2026) or spectral clustering on representations have emerged as promising, though not yet robust, strategies for analyzing or monitoring Trojaned models.
Advances in model vetting, tamper-evident model delivery, and formal certification for adaptive security remain critical open directions. The symbiotic co-evolution of attack and defense in the context of malicious fine-tuning continues to define an active and rapidly evolving research frontier.