Clean-Label Preference Poisoning
- Clean-label preference poisoning is a data attack that imperceptibly perturbs a small subset of correctly labeled samples to control model outputs.
- The method employs constrained optimization techniques, such as bilevel optimization and feature collision, ensuring stealth while altering model selections.
- Empirical results across active learning, recommendation systems, and RLHF show high attack success rates under minimal poisoning budgets, emphasizing the need for robust defenses.
Clean-label preference poisoning is a family of data poisoning attacks in which an adversary imperceptibly perturbs a small subset of training or preference data while preserving the original ground-truth labels. The objective is to hijack model preference or selection behaviors—such as model predictions, reward rankings, or acquisition scores—so that, under a secret trigger or under adversarially chosen conditions, the model exhibits attacker-controlled outputs. Clean-label preference poisoning is distinguished from traditional poisoning by its stealth: poisoned data remain correctly labeled and visually or semantically indistinguishable, thereby evading data audits and label-flip detection. These attacks have demonstrated efficacy across active learning, deep recommendation, transfer learning, and reinforcement learning from human feedback (RLHF).
1. Threat Models and Core Definitions
Clean-label preference poisoning attacks operate under stringent constraints:
- No label manipulation: All poisoned samples are labeled correctly by human annotators or preference oracles.
- Imperceptibility: Perturbations to poisoned data are typically bounded, visually undetectable, or induce minimal semantic shift.
- Minimal budget: The attacker injects a limited fraction of poisons (often sub-1% to a few percent).
- Attacker’s knowledge: Varies from white-box (access to model weights and acquisition functions at runtime—e.g., in active learning attacks (Zhi et al., 5 Aug 2025)) to black-box (reward model and RLHF pipeline unknown—e.g., T2I BadReward (Duan et al., 3 Jun 2025)).
Attack objectives include:
- Backdoor induction: At inference, inputs embedded with a specific trigger (e.g., a patch or prompt token) are systematically mapped to a target class, preference, or output, while clean behavior remains unaltered.
- Preference hijack: For pairwise or listwise ranking and reward-based systems, subtly crafted poisons induce preference flips or biased generation when a secret trigger is activated.
2. Methodologies and Optimization Frameworks
The central mechanism for clean-label preference poisoning is optimization over a constrained space of imperceptible perturbations, in order to manipulate model internals or selection outputs via preference-based or ranking losses.
2.1 Bilevel Optimization and MetaPoison
MetaPoison (Huang et al., 2020) generalizes clean-label poisoning to recommendation and preference systems via bilevel optimization: where . Pairwise or listwise loss may be used (e.g., BPR, hinge), and the adversarial objective manipulates the ranking on a target pair (e.g., force the model to prefer over for user ). Optimization employs truncated meta-gradients and ensembles of surrogates to ensure transferability and minimize overfitting to a particular SGD trajectory (Huang et al., 2020).
2.2 Selection-Aware Poisoning in Active Learning
ALA (Zhi et al., 5 Aug 2025) introduces a selection-specific variant: where is the acquisition score (e.g., entropy, least confidence), and is a clean data seed. The attacker optimizes perturbations so that poisoned points are preferentially selected by the AL agent from the unlabeled pool—a clean-label selection attack, guaranteeing high likelihood of model retraining on the backdoor trigger without label violations.
2.3 Selective Poisoning by Sample Hardness
Wicked Oddities (Nguyen et al., 2024) demonstrates that poisoning “hard” or outlier samples in a target class, as ranked by pretrained feature distance or surrogate model loss on OOD data, dramatically increases attack success rate, even with no knowledge of other class data or victim architecture. Algorithms select the top- hardest samples by cosine distance or classifier loss, then inject triggers solely in this subset.
2.4 Feature Collision and Reward Model Attacks in RLHF
BadReward (Duan et al., 3 Jun 2025) constructs feature-collided poisons for text-to-image reward models, solving
to create that remain visually like but mimic target semantics (e.g., harmful concept ) in CLIP space. The victim model, trained with these clean-labeled pairs, learns to assign high reward to under trigger even without observable label inconsistencies.
3. Application Domains and Empirical Results
Active Learning
ALA shows that a poisoning budget of – can achieve attack success rates (ASR) of up to 94% under least-confidence or entropy-based acquisition, with ID/OOD accuracy preserved within of the baseline. Imperceptible SIG triggers yield high (selection rates ), while CL-patch triggers are far less effective (Zhi et al., 5 Aug 2025).
RLHF for Text-to-Image Generation
BadReward reports an ASR increase from (clean) to $0.84$–$1.0$ on training prompts and $0.59$–$0.90$ on GPT-regenerated prompts, using only poison. Reward overlap (RO) remains , and stealth metrics (SSIM , PSNR dB, LPIPS ) confirm human indistinguishability (Duan et al., 3 Jun 2025).
Transfer Learning, Supervised Classification, and Decentralized Settings
Bullseye Polytope (Aghakhani et al., 2020) achieves up to absolute improvement in transfer learning ASR versus convex polytope baselines, scaling up success in both end-to-end and multi-view settings, with time-to-poisoning accelerated ($10$– faster). Wicked Oddities’ sample selection methods consistently outperform random poisoning by $20$–$40$ points in ASR, maintaining benign accuracy () (Nguyen et al., 2024).
4. Attack Algorithms and Practical Implementations
The practical attack pipeline across domains typically consists of:
- Selection of candidate seeds (target-class samples, OOD pool, or preference pairs).
- Optimization of perturbations to maximize the relevant selection/ranking objective, subject to imperceptibility—via PGD, genetic algorithms, bilevel meta-gradients, or feature-collision updates.
- Injection into training or preference data, with triggers embedded and labels unaltered.
- Reliant on victim workflow: For AL, poisons are selected through the standard acquisition process; for RLHF, poisons pass human evaluation undetected; for black-box scenarios, attacks rely on transferability of learned feature representations.
Both MetaPoison and Bullseye Polytope recommend ensembling surrogate models, truncated optimization steps, and carefully tuned hyperparameters for maximum transferability and stealth (Aghakhani et al., 2020, Huang et al., 2020).
5. Defensive Mechanisms and Mitigation Strategies
Defensive concepts include:
- Acquisition perturbation hardening: Entropy regularization, confidence penalties, and use of diversity-based acquisition sampling reduce single-metric vulnerability (Zhi et al., 5 Aug 2025).
- Anomaly and feature-space monitoring: Outlier removal in embedding space (e.g., deep -NN, filters) impairs both feature-collision and bullseye-polytope attacks, but defenders must balance precision and collateral data loss (Aghakhani et al., 2020).
- Reward monitoring and consensus: Multi-model reward validation, CLIP-feature sanitization, and batch-wise anomaly detection flag latent preference anomalies in RLHF pipelines (Duan et al., 3 Jun 2025).
- Holdout validation: Monitoring selection patterns and accuracy gaps on curated validation sets can audit and diagnose preference drift due to poisoning.
- Randomization/composition: Ensembles or randomly switched acquisition/reward models increase robustness by frustrating attacker optimization against fixed selection rules.
Defensive trade-offs often include decreased label efficiency, model utility, or increased false-positive sample rejections.
6. Broader Implications and Distinct Vulnerabilities
Preference and selection-based poisoning introduces an indirect mode of attack that can bypass traditional data-cleaning, outlier, or label-flip defenses. The attack composes stealth and efficacy by weaponizing core procedures in label-efficient, active, or preference-driven learning. The dynamic feedback loop in active learning and RLHF compounds the effect: early poisons amplify model vulnerability in successive rounds through retraining and data-selection bias (Zhi et al., 5 Aug 2025, Duan et al., 3 Jun 2025). Unlike classic backdoor attacks, preference poisoning can hijack complex, multi-modal reward behaviors (e.g., modality-crossing in T2I), induce controlled selection in decentralized data collection (Nguyen et al., 2024), and persist under transfer learning or black-box constraints.
The field has highlighted a pressing need for new theoretical and practical defense paradigms—certified poisoning robustness, fine-grained preference sanitization, and learning-theoretic guarantees for acquisition and reward safety. Existing anomaly monitoring, adversarial regularization, and defensive ensemble techniques provide only partial mitigation, and strong, general-purpose defenses remain an open research area.
Key references:
- "Selection-Based Vulnerabilities: Clean-Label Backdoor Attacks in Active Learning" (Zhi et al., 5 Aug 2025)
- "BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF" (Duan et al., 3 Jun 2025)
- "Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks" (Nguyen et al., 2024)
- "Bullseye Polytope: A Scalable Clean-Label Poisoning Attack with Improved Transferability" (Aghakhani et al., 2020)
- "MetaPoison: Practical General-purpose Clean-label Data Poisoning" (Huang et al., 2020)