Reward Poisoning in Reinforcement Learning

Updated 14 January 2026

Reward poisoning is an adversarial strategy that alters reward signals to force a target policy through minimal, strategic perturbations.
Methodologies include linear programming, bilevel optimization, and adaptive algorithms across RL, contextual bandits, and RLHF settings.
Robust defenses such as verification-based strategies, statistical filtering, and adversarial training are critical to mitigate these attacks.

Reward poisoning is an adversarial manipulation of the reward signals or reward-related data that an agent or set of agents uses to learn or optimize policies. This adversary strategically modifies reward information—either at the level of streaming environment feedback, offline datasets, preference labels, or human annotation pipelines—to steer the learner toward a malicious, desired behavior. Reward poisoning has been demonstrated as a powerful attack vector across single- and multi-agent reinforcement learning (RL), multi-modal and language-model RL from human feedback (RLHF), contextual bandits, and preference-learning systems. Attackers may be black-box or white-box, operate under budget constraints, and can sometimes accomplish their goals using minimal or “stealthy” perturbations with little observable impact on nominal behavior.

1. Formal Models and Attack Objectives

Reward poisoning can be instantiated under various learning and operational regimes. In tabular RL and MDPs, the attacker may perturb the true reward function $r(s,a)$ to $\tilde r(s,a) = r(s,a) + \delta(s,a)$ at each (state,action) during training (Zhang et al., 2020, Rakhsha et al., 2020, Rakhsha et al., 2021). In offline RL, the adversary manipulates observed reward entries in the training dataset, replacing $r_i$ by $r_i + \Delta_i$ within $\ell_\infty$ and/or $\ell_1$ budgets (Xu et al., 2024, Wu et al., 2022). In multi-agent settings, the attack is generalized to joint action and Markov game frameworks, with the goal of installing a malicious joint equilibrium (Wu et al., 2022).

For reward-model and RLHF training pipelines—especially in LLM or text-to-image domains—attacks may target the output of the preference aggregation process, either by flipping labels in pairwise comparisons (Wang et al., 2023, Wu et al., 2024) or by introducing natural-looking, bias-inducing examples that corrupt reward-model fitting (Duan et al., 3 Jun 2025). In all cases, the core attack goal is to induce the learner to choose a target (joint) policy $\pi^\dagger$ (or class of policies), either by making it uniquely optimal under the manipulated reward or by making all alternative policies suboptimal by a margin.

Formally, the attacker solves an optimization problem of the type: $\min_{\delta}~\|\delta\| ~~ \textrm{s.t.}~ \pi^\dagger = \arg\max_\pi~ V_{\tilde r}^\pi,~\textrm{or~} V_{\tilde r}^{\pi^\dagger} \geq V_{\tilde r}^{\pi} + \epsilon,~\forall\pi\neq\pi^\dagger$ with constraints encoding attack budget (per-step and/or cumulative), stealth requirements, or data coverage (Rakhsha et al., 2020, Wu et al., 2022, Xu et al., 2024).

2. Methodologies: Attack Algorithms and Characterization

A broad spectrum of methodologies has been developed for reward-poisoning attacks:

Offline Linear Programming (LP) and Convex Optimization: In tabular settings and offline MARL, the attack problem can be formulated as an LP, optimizing the size of reward perturbations $\|\delta\|_p$ given constraints that install the target policy as a unique equilibrium (e.g. a Markov Perfect Dominant-Strategy Equilibrium, MPDSE) (Wu et al., 2022, Rakhsha et al., 2020). For robust optimality, constraints may be enforced only against single-action deviations.
Bilevel and Penalty-Based Formulations: For continual or online settings, particularly under black-box assumptions, the attacker solves a bilevel program in which the lower level ensures the Bellman fixed-point equates for $\tilde r$ and the upper-level penalizes the value gap in favor of the target policy, often using sample-based stochastic gradients (Li et al., 2024, Zhang et al., 27 Nov 2025).
Adaptive and Non-Adaptive Attacks: Adaptive attacks condition perturbations on the current learner’s internal state (e.g. $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 0-table) and can manipulate policies to $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 1 in polynomial time for general MDPs, whereas non-adaptive attacks (state-action-reward only) require exponentially many interactions (Zhang et al., 2020).
Clean-Label and Preference-Poisoning Attacks: In preference-based RLHF and reward-model pipelines, adversaries may either flip a fraction of preference labels (dirty-label), or insert clean-label but semantically contradictory comparisons constructed to collide representations in feature/embedding space (Duan et al., 3 Jun 2025, Wang et al., 2023, Wu et al., 2024).
Black-Box Targeted Attacks: Methods require no knowledge of environment or learner—perturbing observed rewards online according to a distance metric to the target policy, using action-dependent penalties, with both per-step and total attack budgets (Xu et al., 2023, Xu et al., 2022, Rakhsha et al., 2021, Xu et al., 2024).
Sublinear Cost Attacks: Under order-optimal learning, attackers with only $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 2 total contamination can still force the learner to a malicious policy (Rangi et al., 2022). For bandit settings, $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 3 attacks suffice to destroy optimal regret unless defenses are employed (Rangi et al., 2021).

3. Empirical and Theoretical Impact

Reward poisoning attacks have been empirically demonstrated to degrade the performance of reinforcement learners, neural bandits, and reward-model-driven LLMs or T2I systems across classic control (CartPole, MountainCar, Acrobot), MuJoCo robot domains (Hopper, Walker2d, HalfCheetah), grid-world navigation, and RLHF pipelines (Xu et al., 2022, Zhang et al., 27 Nov 2025, Xu et al., 2024, Wu et al., 2022, Wang et al., 2023, Duan et al., 3 Jun 2025).

Key findings include:

Black-box reward poisoning can reduce episode rewards by 60–85% on triggered deployment while remaining undetectable under standard metrics (<5% degradation in clean scenarios) (Zhang et al., 27 Nov 2025).
In offline RL, the policy contrast attack achieves catastrophic policy degradation using a small $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 4-budget, often matching full-reward inversion attacks at a fraction of the cost (Xu et al., 2024).
RLHF reward-poisoning (e.g., RankPoison, BadReward) can systematically bias LLMs or diffusion models toward desired malicious behaviors or install backdoors with as little as 0.1–5% of preference labels corrupted while preserving alignment performance on benign tasks (Wang et al., 2023, Duan et al., 3 Jun 2025).
In bandit settings, regret can be driven to $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 5 with only $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 6 adversarial contamination unless explicit defenses are deployed (Rangi et al., 2021, Xu et al., 2024).

4. Factors Determining Attack Feasibility, Cost, and Stealth

Critical determinants of attack feasibility and stealth include:

Reward magnitude constraint: For bounded rewards (e.g., $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 7), pure reward-only attacks may not suffice in some cases and must be combined with action or transition tampering (Rangi et al., 2022).
Data coverage: If some (state,action) pairs have no support in the dataset, offline attacks cannot enforce uniqueness of the target policy on those entries, regardless of perturbation size (Wu et al., 2022).
Attack budget: Sublinear (in time horizon) attack budgets often suffice for policy hijack in sequential settings; cost-optimal attacks can be constructed via linear programming or black-box meta-learning (Zhang et al., 2020, Rangi et al., 2022, Rakhsha et al., 2021).
Stealth metrics: Stealth is typically measured by deviation in cumulative reward, agent performance under non-triggered conditions, distribution of perturbed rewards, or degree of label-permutation in RLHF. State-of-the-art attacks exhibit minimal impact on these metrics until the attack is triggered (Zhang et al., 27 Nov 2025, Duan et al., 3 Jun 2025).
Adaptivity and exploration: Adaptive attacks exploiting the learner’s Q-values dominate in efficiency for environments with large state spaces or sparse coverage (Zhang et al., 2020).

5. Defenses and Mitigation Strategies

Defensive methodologies are at an early stage but include:

Verification-based strategies: Interleaving a logarithmic number of trusted reward verifications (e.g., Secure-UCB, Secure-BARBAR) can fully restore sample-efficient regret guarantees in bandits under arbitrary levels of reward contamination (Rangi et al., 2021).
Robust reward filtering: Agents can apply statistical smoothing, outlier detection, or confidence intervals to screen for reward anomalies (Zhang et al., 27 Nov 2025, Li et al., 2024, Xu et al., 2024).
Algorithmic robustification: Robust Thompson Sampling and robust UCB variants can bound regret under $\tilde r(s,a) = r(s,a) + \delta(s,a)$ 8-budgeted adversaries, using pseudo-posteriors and clipped regression (Xu et al., 2024, Sasnauskas et al., 7 Jun 2025).
Adversarial training: In in-context RL (DPT, AT-DPT) and RLHF pipelines, adversarial min-max training of agent and attacker reduces regret under attack and improves robustness beyond classical methods (Sasnauskas et al., 7 Jun 2025).
Preference-data auditing: Cross-validation of annotator labels, regularization of reward-model outputs, and multi-modal consistency checks help detect or mitigate malicious biases in RLHF (Wang et al., 2023, Duan et al., 3 Jun 2025).
Certified defenses: Certified offline RL defenses randomize or smooth data to guarantee robustness up to a threshold, though these remain theoretical in model-based RL settings (Wu et al., 2022).

6. Open Questions and Research Directions

Key challenges remain in designing RL and RLHF systems robust to reward poisoning:

Tight lower and upper bounds on minimal attack budget under algorithmic or information-theoretic restrictions.
Formal guarantees for robust sequential RL under cumulative adversarial contamination.
Robust preference and reward-model learning under adversarial manipulation, particularly in high-dimensional, sparse, or non-stationary data regimes.
Certified, distribution-free defenses that scale with large state/action spaces, ambiguous annotations, and complex reward models.
Hybridization of robust statistical methods, adversarial training, and meta-learning to immunize against both dirty-label and clean-label attacks.
Understanding and mitigating reward poisoning in federated, privacy-preserving, and partially observable RL.
Deployment-time detection and response strategies for environments where verification or trusted feedback is constrained or unavailable.

Reward poisoning represents a pervasive and challenging threat to RL and RLHF systems, with attack methodologies now well-understood for both white-box and black-box adversaries under a variety of operational settings. Effective defense is a central open problem that spans optimization theory, sequential statistics, robust learning, and system security. For a comprehensive treatment of attack and defense mechanisms, see (Wu et al., 2022, Xu et al., 2022, Xu et al., 2024, Zhang et al., 27 Nov 2025, Xu et al., 2024, Duan et al., 3 Jun 2025, Wang et al., 2023).