Clean-Label Preference Poisoning

Updated 14 January 2026

Clean-label preference poisoning is a data attack that imperceptibly perturbs a small subset of correctly labeled samples to control model outputs.
The method employs constrained optimization techniques, such as bilevel optimization and feature collision, ensuring stealth while altering model selections.
Empirical results across active learning, recommendation systems, and RLHF show high attack success rates under minimal poisoning budgets, emphasizing the need for robust defenses.

Clean-label preference poisoning is a family of data poisoning attacks in which an adversary imperceptibly perturbs a small subset of training or preference data while preserving the original ground-truth labels. The objective is to hijack model preference or selection behaviors—such as model predictions, reward rankings, or acquisition scores—so that, under a secret trigger or under adversarially chosen conditions, the model exhibits attacker-controlled outputs. Clean-label preference poisoning is distinguished from traditional poisoning by its stealth: poisoned data remain correctly labeled and visually or semantically indistinguishable, thereby evading data audits and label-flip detection. These attacks have demonstrated efficacy across active learning, deep recommendation, transfer learning, and reinforcement learning from human feedback (RLHF).

1. Threat Models and Core Definitions

Clean-label preference poisoning attacks operate under stringent constraints:

No label manipulation: All poisoned samples are labeled correctly by human annotators or preference oracles.
Imperceptibility: Perturbations to poisoned data are typically $\ell_p$ bounded, visually undetectable, or induce minimal semantic shift.
Minimal budget: The attacker injects a limited fraction of poisons (often sub-1% to a few percent).
Attacker’s knowledge: Varies from white-box (access to model weights and acquisition functions at runtime—e.g., in active learning attacks (Zhi et al., 5 Aug 2025)) to black-box (reward model and RLHF pipeline unknown—e.g., T2I BadReward (Duan et al., 3 Jun 2025)).

Attack objectives include:

Backdoor induction: At inference, inputs embedded with a specific trigger (e.g., a patch or prompt token) are systematically mapped to a target class, preference, or output, while clean behavior remains unaltered.
Preference hijack: For pairwise or listwise ranking and reward-based systems, subtly crafted poisons induce preference flips or biased generation when a secret trigger is activated.

2. Methodologies and Optimization Frameworks

The central mechanism for clean-label preference poisoning is optimization over a constrained space of imperceptible perturbations, in order to manipulate model internals or selection outputs via preference-based or ranking losses.

2.1 Bilevel Optimization and MetaPoison

MetaPoison (Huang et al., 2020) generalizes clean-label poisoning to recommendation and preference systems via bilevel optimization: $\min_{X_p \in \mathcal{C}} \ell_{\mathrm{adv}}^{\text{pref}}(\theta^\star(X_p))$ where $\theta^\star(X_p) = \arg\min_\theta \sum_{(s)} \ell_{\mathrm{train}}^{\text{pairwise}}(\theta; X_c, X_p)$ . Pairwise or listwise loss may be used (e.g., BPR, hinge), and the adversarial objective manipulates the ranking on a target pair (e.g., force the model to prefer $j_{\text{adv}}$ over $i_t$ for user $u_t$ ). Optimization employs truncated meta-gradients and ensembles of surrogates to ensure transferability and minimize overfitting to a particular SGD trajectory (Huang et al., 2020).

2.2 Selection-Aware Poisoning in Active Learning

ALA (Zhi et al., 5 Aug 2025) introduces a selection-specific variant: $\max_{\|\delta\|_p \leq \varepsilon} \alpha(M_\theta(x+\delta))$ where $\alpha(\cdot)$ is the acquisition score (e.g., entropy, least confidence), and $x$ is a clean data seed. The attacker optimizes perturbations so that poisoned points are preferentially selected by the AL agent from the unlabeled pool—a clean-label selection attack, guaranteeing high likelihood of model retraining on the backdoor trigger without label violations.

2.3 Selective Poisoning by Sample Hardness

Wicked Oddities (Nguyen et al., 2024) demonstrates that poisoning “hard” or outlier samples in a target class, as ranked by pretrained feature distance or surrogate model loss on OOD data, dramatically increases attack success rate, even with no knowledge of other class data or victim architecture. Algorithms select the top- $m$ hardest samples by cosine distance or classifier loss, then inject triggers solely in this subset.

2.4 Feature Collision and Reward Model Attacks in RLHF

BadReward (Duan et al., 3 Jun 2025) constructs feature-collided poisons for text-to-image reward models, solving

$L_\text{col}(x) = \|g_\text{CLIP}(x) - g_\text{CLIP}(x_t)\|_2^2 + \beta \|x - x_b\|_2^2$

to create $x_\text{collide}$ that remain visually like $x_b$ but mimic target semantics (e.g., harmful concept $\mathcal{C}$ ) in CLIP space. The victim model, trained with these clean-labeled pairs, learns to assign high reward to $\mathcal{C}$ under trigger $t$ even without observable label inconsistencies.

3. Application Domains and Empirical Results

Active Learning

ALA shows that a poisoning budget of $\rho = 0.5\%$ – $1.0\%$ can achieve attack success rates (ASR) of up to 94% under least-confidence or entropy-based acquisition, with ID/OOD accuracy preserved within $1\%$ of the baseline. Imperceptible SIG triggers yield high $R_{\text{select}}$ (selection rates $> 90\%$ ), while CL-patch triggers are far less effective (Zhi et al., 5 Aug 2025).

RLHF for Text-to-Image Generation

BadReward reports an ASR increase from $\sim 0.1$ (clean) to $0.84$–$1.0$ on training prompts and $0.59$–$0.90$ on GPT-regenerated prompts, using only $3\%$ poison. Reward overlap (RO) remains $>0.90$ , and stealth metrics (SSIM $>0.86$ , PSNR $>24$ dB, LPIPS $<0.23$ ) confirm human indistinguishability (Duan et al., 3 Jun 2025).

Transfer Learning, Supervised Classification, and Decentralized Settings

Bullseye Polytope (Aghakhani et al., 2020) achieves up to $+27\%$ absolute improvement in transfer learning ASR versus convex polytope baselines, scaling up success in both end-to-end and multi-view settings, with time-to-poisoning accelerated ($10$– $80\times$ faster). Wicked Oddities’ sample selection methods consistently outperform random poisoning by $20$–$40$ points in ASR, maintaining benign accuracy ( $\sim95\%$ ) (Nguyen et al., 2024).

4. Attack Algorithms and Practical Implementations

The practical attack pipeline across domains typically consists of:

Selection of candidate seeds (target-class samples, OOD pool, or preference pairs).
Optimization of perturbations to maximize the relevant selection/ranking objective, subject to imperceptibility—via PGD, genetic algorithms, bilevel meta-gradients, or feature-collision updates.
Injection into training or preference data, with triggers embedded and labels unaltered.
Reliant on victim workflow: For AL, poisons are selected through the standard acquisition process; for RLHF, poisons pass human evaluation undetected; for black-box scenarios, attacks rely on transferability of learned feature representations.

Both MetaPoison and Bullseye Polytope recommend ensembling surrogate models, truncated optimization steps, and carefully tuned hyperparameters for maximum transferability and stealth (Aghakhani et al., 2020, Huang et al., 2020).

5. Defensive Mechanisms and Mitigation Strategies

Defensive concepts include:

Acquisition perturbation hardening: Entropy regularization, confidence penalties, and use of diversity-based acquisition sampling reduce single-metric vulnerability (Zhi et al., 5 Aug 2025).
Anomaly and feature-space monitoring: Outlier removal in embedding space (e.g., deep $k$ -NN, $\ell_2$ filters) impairs both feature-collision and bullseye-polytope attacks, but defenders must balance precision and collateral data loss (Aghakhani et al., 2020).
Reward monitoring and consensus: Multi-model reward validation, CLIP-feature sanitization, and batch-wise anomaly detection flag latent preference anomalies in RLHF pipelines (Duan et al., 3 Jun 2025).
Holdout validation: Monitoring selection patterns and accuracy gaps on curated validation sets can audit and diagnose preference drift due to poisoning.
Randomization/composition: Ensembles or randomly switched acquisition/reward models increase robustness by frustrating attacker optimization against fixed selection rules.

Defensive trade-offs often include decreased label efficiency, model utility, or increased false-positive sample rejections.

6. Broader Implications and Distinct Vulnerabilities

Preference and selection-based poisoning introduces an indirect mode of attack that can bypass traditional data-cleaning, outlier, or label-flip defenses. The attack composes stealth and efficacy by weaponizing core procedures in label-efficient, active, or preference-driven learning. The dynamic feedback loop in active learning and RLHF compounds the effect: early poisons amplify model vulnerability in successive rounds through retraining and data-selection bias (Zhi et al., 5 Aug 2025, Duan et al., 3 Jun 2025). Unlike classic backdoor attacks, preference poisoning can hijack complex, multi-modal reward behaviors (e.g., modality-crossing in T2I), induce controlled selection in decentralized data collection (Nguyen et al., 2024), and persist under transfer learning or black-box constraints.

The field has highlighted a pressing need for new theoretical and practical defense paradigms—certified poisoning robustness, fine-grained preference sanitization, and learning-theoretic guarantees for acquisition and reward safety. Existing anomaly monitoring, adversarial regularization, and defensive ensemble techniques provide only partial mitigation, and strong, general-purpose defenses remain an open research area.

Key references:

"Selection-Based Vulnerabilities: Clean-Label Backdoor Attacks in Active Learning" (Zhi et al., 5 Aug 2025)
"BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF" (Duan et al., 3 Jun 2025)
"Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks" (Nguyen et al., 2024)
"Bullseye Polytope: A Scalable Clean-Label Poisoning Attack with Improved Transferability" (Aghakhani et al., 2020)
"MetaPoison: Practical General-purpose Clean-label Data Poisoning" (Huang et al., 2020)

Markdown Report Issue Upgrade to Chat

References (5)

Selection-Based Vulnerabilities: Clean-Label Backdoor Attacks in Active Learning (2025)

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF (2025)

MetaPoison: Practical General-purpose Clean-label Data Poisoning (2020)

Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks (2024)

Bullseye Polytope: A Scalable Clean-Label Poisoning Attack with Improved Transferability (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Clean-Label Preference Poisoning.

Clean-Label Preference Poisoning

1. Threat Models and Core Definitions

2. Methodologies and Optimization Frameworks

2.1 Bilevel Optimization and MetaPoison

2.2 Selection-Aware Poisoning in Active Learning

2.3 Selective Poisoning by Sample Hardness

2.4 Feature Collision and Reward Model Attacks in RLHF

3. Application Domains and Empirical Results

Active Learning

RLHF for Text-to-Image Generation

Transfer Learning, Supervised Classification, and Decentralized Settings

4. Attack Algorithms and Practical Implementations

5. Defensive Mechanisms and Mitigation Strategies

6. Broader Implications and Distinct Vulnerabilities

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Clean-Label Preference Poisoning

1. Threat Models and Core Definitions

2. Methodologies and Optimization Frameworks

2.1 Bilevel Optimization and MetaPoison

2.2 Selection-Aware Poisoning in Active Learning

2.3 Selective Poisoning by Sample Hardness

2.4 Feature Collision and Reward Model Attacks in RLHF

3. Application Domains and Empirical Results

Active Learning

RLHF for Text-to-Image Generation

Transfer Learning, Supervised Classification, and Decentralized Settings

4. Attack Algorithms and Practical Implementations

5. Defensive Mechanisms and Mitigation Strategies

6. Broader Implications and Distinct Vulnerabilities

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research