α-Reward-Preserving Attacks in RL

Updated 19 January 2026

α-Reward-Preserving Attacks are adversarial interventions in reinforcement learning designed to manipulate reward signals while maintaining an α fraction of the policy’s nominal return.
These attacks leverage techniques like dynamic reward poisoning and gradient-based perturbations to balance stealth with effective control over agent behavior.
Empirical studies show high success rates and near-optimal return preservation, highlighting the critical need for robust defense mechanisms in RL systems.

The concept of $α$ -reward-preserving attacks in reinforcement learning (RL) defines a class of adversarial interventions designed to induce a specified malicious behaviour—such as the forced execution of an action upon presentation of a trigger—while concurrently maintaining a policy’s nominal return above an $α$ fraction of its unpoisoned counterpart. These attacks are distinguished by their combination of stealth (high reward preservation, $α \rightarrow 1$ ) and potent control over victim agent behaviour. Research spanning backdoor poisoning, robust RL, and reward perturbation optimization has developed precise theoretical, algorithmic, and empirical frameworks for both attack and defense, as documented in "SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents" (Rathbun et al., 2024), "Reward-Preserving Attacks For Robust Reinforcement Learning" (Schott et al., 12 Jan 2026), and "Defense Against Reward Poisoning Attacks in Reinforcement Learning" (Banihashem et al., 2021).

1. Formal Definitions and Threat Model

At its core, an $α$ -reward-preserving attack pertains to an RL agent trained on a Markov decision process (MDP) $M = (S, A, T, R, γ)$ , comprising a state space $S$ , action set $A$ , transition kernel $T$ , reward function $R$ , and discount factor $γ \in [0,1)$ . Let $\pi_{clean}$ denote the policy trained without intervention, yielding expected discounted return $J(\pi_{clean})$ . An adversarially trained policy $\pi_{poison}$ under a backdoor or perturbation mechanism is said to be $α$ -reward-preserving if $J(\pi_{poison}) \geq α \cdot J(\pi_{clean})$ , i.e., the poisoned agent achieves at least fraction $α$ of the original return.

In backdoor threat models (Rathbun et al., 2024), an adversary defines a trigger function $δ: S \rightarrow S_p$ , mapping benign states to poisoned versions. Poisoning is achieved by injecting $δ(s)$ during training with budget $β$ , alongside reward manipulation through dynamic functions $ΔR$ . In reward-perturbation settings (Schott et al., 12 Jan 2026, Banihashem et al., 2021), the attacker solves a constrained optimization to induce a target policy $\pi^†$ as uniquely optimal within an $\ell_2$ -distance-minimizing modification of $R$ , enforcing margin $α$ over all neighbouring policies.

This threat model introduces both a measure of attack strength (progress towards the adversary’s goals) and stealth (avoidance of detection through performance metrics).

2. Attack Construction and Optimization Strategies

The adversary’s optimization problem formalizes two competing objectives:

Attack Success: With high probability, induce a specified adversarial outcome—e.g., force action $a^+$ when a trigger is present.
Reward Preservation: Maintain episodic or per-state return within an $\alpha$ fraction of the nominal baseline, minimizing the risk of detection or performance degradation.

This dual-objective is typically cast as: $\max_{\text{MDP}'=(S ∪ S_p, A, T', R')}\;E_{s \sim d, \pi'\sim L(\text{MDP}')}\left[\pi'(δ(s), a^+)\right] \quad \text{s.t.}\;J_M(\pi')\geq α J_M(\pi_{clean})$ where $L(\text{MDP}')$ denotes the learning algorithm (e.g., PPO), and $J_M$ evaluates on the original MDP.

In "SleeperNets" (Rathbun et al., 2024), the reward-poisoning function $ΔR$ is chosen dynamically such that:

The Bellman backup at benign states is preserved exactly.
At poisoned states, reward immediately and exclusively favours $a^+$ , indifferent to downstream value.
The overall optimal policy in the modified MDP coincides with an $α$ -optimal policy for the benign task.

In robust RL settings (Schott et al., 12 Jan 2026), attacks are tuned per state and action by decomposing perturbations into a gradient-derived direction $A(s,a)$ and magnitude $η(s,a)$ , maximizing adversarial impact subject to the per-state constraint: $Q^{*, \Omega^\xi}(s,a) \geq Q^{*, \Omega^{\xi^*}}(s,a) + α \left[ Q^{*,\Omega}(s,a) - Q^{*, \Omega^{\xi^*}}(s,a) \right]$ where $\Omega^{\xi^*}$ denotes the worst-case adversary and $Q^{*,\Omega}(s,a)$ the nominal Q-value.

3. Algorithmic Realizations

Attacks are implemented via an episode-level procedure:

Sample trajectory $H$ under policy $\pi$ .
Randomly select a fraction $β$ of transitions in $H$ for poisoning.
For each poisoned $(s_t, a_t, r_t)$ $(s_{t}, a_{t}, r_{t})$ :
- Apply trigger: $s_t \leftarrow δ(s_t)$ .
- Poison the reward: $r_t \leftarrow 1_c[a_t = a^+] - α γ \hat{V}(s_{t+1})$
- Retroactively adjust $r_{t-1}$ to preserve backup:
  
  $r_{t-1} \leftarrow r_{t-1} - γ r_t + γ \hat{V}(s_t)$
Update replay buffer and optimize policy as usual.

Notably, retroactive reward correction ensures that benign transitions remain unaffected in expected value, maintaining stealth.

Deep RL attacks select a unit-norm direction $A(s,a)$ via maximal negative gradient of the critic, and search over candidate perturbation magnitudes $η$ :

Compute $Q_0 = Q_{ψ_c}((s,a), η_B)$ and $Q_1 = Q_{ψ_c}((s,a), 0)$ .
Form $\widehat{Q}_α(s,a) = Q_0 + α(Q_1 - Q_0)$ .
Search for maximal $η$ such that $Q_{ψ_α}((s,a), η) \geq \widehat{Q}_α(s,a)$ .
Apply $η^* A(s,a)$ perturbation.

Joint critic networks $Q_{ψ_α}$ (dynamic) and $Q_{ψ_c}$ (static) are trained off-policy to evaluate robustness across radii and actions.

4. Theoretical Guarantees and Analytical Characterization

Theoretical analysis in both backdoor and robust settings delivers the following guarantees:

Backdoor Success and Stealth (SleeperNets): Theorems establish, under capacity and data sufficiency, that the optimal poisoned policy will (i) choose $a^+$ with probability one on trigger, and (ii) match return with the clean policy on benign states, achieving $α=1$ exactly (Rathbun et al., 2024).
Per-State Robustness Constraint (Reward-Preserving): For any $(s,a)$ , the per-state or per-action return under attack is bounded below by $V^{worst}(s) + α \Delta(s)$ , preserving an $α$ fraction of the nominal-to-worst-case gap (Schott et al., 12 Jan 2026).

Additional properties include reward structure preservation (the Q-value ordering remains unchanged for extremes) and preference shift quantification, determining when action selection orderings reverse as a function of adversary strength and $α$ .

5. Defense Strategies and Performance Guarantees

Defensive frameworks against $α$ -reward-preserving attacks, as established in (Banihashem et al., 2021), utilize robust optimization over occupancy measures:

LP Formulation: Agents optimize over all policies to maximize worst-case return under the poisoned reward vector $\hat{R}$ subject to known or bounded $α$ .
Certificates: For any defense policy $\pi_D$ derived, worst-case return on the true reward $R'$ will not fall below its performance on $\hat{R}$ , and its suboptimality gap relative to the attacker's target policy $\pi^†$ is tightly upper-bounded.
Extensions: When $α$ is unknown, setting an upper bound $\bar α$ preserves guarantees provided over-estimation. Under-estimation leads to collapse to the attacker's target.

These optimization programs are tractable and yield nontrivial guarantees for robustness against minimally stealthy reward-poisoning.

6. Empirical Protocols and Quantitative Results

Empirical studies in (Rathbun et al., 2024, Schott et al., 12 Jan 2026) evaluate attack and defense across diverse environments:

Method	Attack Success Rate (ASR)	Benign Return Ratio (BRR)
SleeperNets	100%	≥96.5%
TrojDRL-W	57–99%	26.6–100%
BadRL-M	0–100%	70–100%

SleeperNets achieves perfect attack success and near-perfect return preservation across tasks, often with poisoning budgets $β < 0.001\%$ .
Adaptive Robustness Attacks demonstrate, e.g., on HalfCheetah-v5, that $α ≈ 0.7$ yields robust returns across adversarial radii ( $η$ ) without nominal collapse; fixed-radius baselines lack such generalization (Schott et al., 12 Jan 2026).

Robustness profiles support the trade-off: intermediate $α$ delivers substantial distributional robustness with controlled nominal return loss.

7. Strengths, Limitations, and Research Directions

$α$ -reward-preserving attacks offer:

Provable stealth: Guaranteed preservation of reward performance for arbitrarily strong backdoors or perturbations.
Universality: Applicability across MDPs, observation spaces, and RL architectures.
Efficient resource use: Capability to anneal poisoning budgets ( $β$ ) to extremely low values.

Limitations include the potential detectability of large magnitude reward perturbations via reward-signal monitoring and requirements for threat models with access to full episode data. Assumptions on trigger function invertibility or support disjointness may restrict practical scope.

Open questions and future extensions include:

Development of monitoring defenses sensitive to reward distribution offsets.
Integration of gradient-based trigger optimization for further stealth and efficiency.
Generalization of $α$ -reward-preserving principles to multi-agent and cooperative RL.
Exploration of partial $\alpha$ for enhanced stealth with weaker constraints.
Relaxation of invertibility and disjoint support in trigger generation.

These considerations continue to drive research into both the precision of adversarial attack design and principled defense in reinforcement learning.

Markdown Report Issue Upgrade to Chat

References (3)

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents (2024)

Reward-Preserving Attacks For Robust Reinforcement Learning (2026)

Defense Against Reward Poisoning Attacks in Reinforcement Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to $α$-Reward-Preserving Attacks.

α-Reward-Preserving Attacks in RL

1. Formal Definitions and Threat Model

2. Attack Construction and Optimization Strategies

3. Algorithmic Realizations

SleeperNets Dynamic Reward Poisoning (Rathbun et al., 2024)

Gradient-Based Adaptive Robustness Attacks (Schott et al., 12 Jan 2026)

4. Theoretical Guarantees and Analytical Characterization

5. Defense Strategies and Performance Guarantees

6. Empirical Protocols and Quantitative Results

7. Strengths, Limitations, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

α-Reward-Preserving Attacks in RL

1. Formal Definitions and Threat Model

2. Attack Construction and Optimization Strategies

3. Algorithmic Realizations

SleeperNets Dynamic Reward Poisoning (Rathbun et al., 2024)

Gradient-Based Adaptive Robustness Attacks (Schott et al., 12 Jan 2026)

4. Theoretical Guarantees and Analytical Characterization

5. Defense Strategies and Performance Guarantees

6. Empirical Protocols and Quantitative Results

7. Strengths, Limitations, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics