- The paper introduces INFUSION, a method that uses influence functions to guide subtle training data modifications that reliably alter model behavior without explicit adversarial signals.
- It employs a three-step process—quantifying instance influence, computing gradient-based perturbations, and validating via partial retraining—to affect outcomes in image classifiers and language models.
- Empirical results show significant misclassification shifts on CIFAR-10 and controlled output changes in transformers, highlighting potential vulnerabilities and boundaries in attack effectiveness.
Influence-Guided Data Poisoning via INFUSION
Motivation and Framework
The paper "Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions" (2602.09987) investigates adversarial model manipulation by precisely perturbing training data. Unlike traditional data poisoning attacks, which typically inject explicit examples of the target behavior, INFUSION employs influence function-guided document modifications to induce desired parameter shifts without revealing explicit adversarial objectives in the corpus. The methodology leverages recent advances in scalable influence estimation, particularly EK-FAC approximations, enabling tractable calculation of document-level gradients in large-scale settings.
INFUSION operationalizes attack as a three-step process: (i) quantifying the influence of each training instance on downstream behavioral metrics, (ii) computing gradient-based perturbations that maximize adversarial objectives in parameter space, and (iii) validating induced behavior via partial retraining. The optimization exploits a first-order expansion of parameter and behavioral measurement changes, efficiently solved via Projected Gradient Descent, which avoids brute-force retraining for candidate perturbations.
Experimental Validation and Numerical Results
The core empirical results are established on image classifiers (CIFAR-10), transformers trained on algorithmic tasks (Caesar ciphers), and pretrained small LMs (GPT-Neo on TinyStories). On CIFAR-10, subtle, visually imperceptible perturbations to merely 0.2% (100/45,000) of training images reliably increased target-class probability in every experiment (2,000/2,000), raising top-1 misclassification rates from 10% to 37.4% (p < 10−4). This level of manipulation is competitive with the topline baseline (probe insertion), despite the perturbations not being explicit label flips.
INFUSION-induced datasets demonstrate weak but nontrivial transferability across architectures: perturbations crafted on one architecture (ResNet or CNN) led to misclassification shifts on the other, although the effect was asymmetric (CNN→ResNet stronger than ResNet→CNN). This cross-architecture transfer is stronger than random noise but generally weaker than same-architecture effectiveness, highlighting the role of shared feature representations and dataset characteristics.
For transformers on Caesar cipher tasks, the attack exploits latent algebraic structure, successfully amplifying model likelihoods for targeted (incorrect) output shifts when these shifts align with Fourier modes in the learned embedding space. Attack efficacy correlates with number-theoretic properties (e.g., common factors with alphabet size), underscoring the importance of internal task structure. On LLMs, discrete PGD perturbations in the token space induce measurable likelihood shifts and occasional rank flips (target word overtakes probe word), but rarely full prediction flips, particularly given the diminished poisoning budget and cumulative approximation errors.
Contradictory claim: The attack is most effective at amplifying behaviors already present in the model, rather than installing fundamentally new capabilities—contrary to typical poisoning assumptions. The probabilistic nudge is substantial but insufficient to overcome high-confidence or fully learned behaviors at scale, particularly in large LMs.
Theoretical Implications
The findings establish influence-guided perturbations as a robust attack primitive, extending attribution techniques from interpretability to adversarial manipulation. INFUSION formalizes the connection between document perturbations and downstream behavioral shifts via influence function calculus and first-order Taylor approximations. The framework subsumes prior upweighting-based influence methods and generalizes to continuous and discrete spaces via EK-FAC and simplex-projected PGD.
However, the attenuation of attack efficacy in large-scale, high-confidence models and weak cross-architecture transferability delineates clear boundaries. Influence approximations and retraining horizon effects (attack dilution with longer retraining) indicate that security risk may currently be confined to smaller or marginally aligned systems. Nevertheless, highly selective attacks at low budgets could persist through partial post-training steps, challenging assumptions about the efficacy of surface-level data filtering and necessitating deeper provenance tracking.
Practical Implications and Security Landscape
INFUSION exposes training data interpretability as a double-edged sword: adversaries can generate attacks that are visually or semantically imperceptible, potentially evading common defenses (perplexity filtering, toxicity classifiers, etc.). Because injected behaviors need not be explicit, detection by standard anomaly detection or content-based filters is significantly more challenging. Moreover, transferability across architectures means that open-weight models represent a novel risk—adversaries can compute perturbations on public architectures and datasets that propagate to proprietary systems trained on similar corpora.
Potential defenses include influence-based anomaly detection, data provenance tracking, and dispersion-regularized training to avoid concentration of behavioral leverage in small document subsets. Extending INFUSION to model the full training pipeline—including fine-tuning and RLHF—could dramatically enhance attack persistence, posing severe risks for frontier or production-scale models.
Limitations and Future Directions
The efficacy of INFUSION is currently strongest in vision tasks with continuous optimization and precise influence estimation; its impact on transformers and LLMs is measurable but less decisive, with probability nudges rather than rank or prediction flips. The dependency on white-box access to a proxy model and substantial computational resources constrains immediate misuse, but as influence functions and optimization scale, the practicality of such attacks may increase.
Fragility to retraining duration and limited robustness of perturbations across full pretraining cycles pose significant open questions. Key future directions include improving influence approximations, scaling attacks to frontier models, and investigating attack persistence through post-training and alignment procedures.
Conclusion
INFUSION demonstrates that minimal, influence-guided perturbations to training documents can systematically affect model behavior, sometimes even without explicit examples of the target behavior. While the attack reliably shifts model outputs in vision settings and amplifies latent behaviors in transformers and LLMs, its impact attenuates with scale and confidence. The dual-use nature of influence functions, underlying both interpretability and vulnerability, calls for renewed attention to the provenance and monitoring of training data as foundational components in securing modern ML pipelines.