α-Reward-Preserving Attacks in RL
- α-Reward-Preserving Attacks are adversarial interventions in reinforcement learning designed to manipulate reward signals while maintaining an α fraction of the policy’s nominal return.
- These attacks leverage techniques like dynamic reward poisoning and gradient-based perturbations to balance stealth with effective control over agent behavior.
- Empirical studies show high success rates and near-optimal return preservation, highlighting the critical need for robust defense mechanisms in RL systems.
The concept of -reward-preserving attacks in reinforcement learning (RL) defines a class of adversarial interventions designed to induce a specified malicious behaviour—such as the forced execution of an action upon presentation of a trigger—while concurrently maintaining a policy’s nominal return above an fraction of its unpoisoned counterpart. These attacks are distinguished by their combination of stealth (high reward preservation, ) and potent control over victim agent behaviour. Research spanning backdoor poisoning, robust RL, and reward perturbation optimization has developed precise theoretical, algorithmic, and empirical frameworks for both attack and defense, as documented in "SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents" (Rathbun et al., 2024), "Reward-Preserving Attacks For Robust Reinforcement Learning" (Schott et al., 12 Jan 2026), and "Defense Against Reward Poisoning Attacks in Reinforcement Learning" (Banihashem et al., 2021).
1. Formal Definitions and Threat Model
At its core, an -reward-preserving attack pertains to an RL agent trained on a Markov decision process (MDP) , comprising a state space , action set , transition kernel , reward function , and discount factor . Let denote the policy trained without intervention, yielding expected discounted return . An adversarially trained policy under a backdoor or perturbation mechanism is said to be -reward-preserving if , i.e., the poisoned agent achieves at least fraction of the original return.
In backdoor threat models (Rathbun et al., 2024), an adversary defines a trigger function , mapping benign states to poisoned versions. Poisoning is achieved by injecting during training with budget , alongside reward manipulation through dynamic functions . In reward-perturbation settings (Schott et al., 12 Jan 2026, Banihashem et al., 2021), the attacker solves a constrained optimization to induce a target policy as uniquely optimal within an -distance-minimizing modification of , enforcing margin over all neighbouring policies.
This threat model introduces both a measure of attack strength (progress towards the adversary’s goals) and stealth (avoidance of detection through performance metrics).
2. Attack Construction and Optimization Strategies
The adversary’s optimization problem formalizes two competing objectives:
- Attack Success: With high probability, induce a specified adversarial outcome—e.g., force action when a trigger is present.
- Reward Preservation: Maintain episodic or per-state return within an fraction of the nominal baseline, minimizing the risk of detection or performance degradation.
This dual-objective is typically cast as: where denotes the learning algorithm (e.g., PPO), and evaluates on the original MDP.
In "SleeperNets" (Rathbun et al., 2024), the reward-poisoning function is chosen dynamically such that:
- The Bellman backup at benign states is preserved exactly.
- At poisoned states, reward immediately and exclusively favours , indifferent to downstream value.
- The overall optimal policy in the modified MDP coincides with an -optimal policy for the benign task.
In robust RL settings (Schott et al., 12 Jan 2026), attacks are tuned per state and action by decomposing perturbations into a gradient-derived direction and magnitude , maximizing adversarial impact subject to the per-state constraint: where denotes the worst-case adversary and the nominal Q-value.
3. Algorithmic Realizations
SleeperNets Dynamic Reward Poisoning (Rathbun et al., 2024)
Attacks are implemented via an episode-level procedure:
- Sample trajectory under policy .
- Randomly select a fraction of transitions in for poisoning.
- For each poisoned :
- Apply trigger: .
- Poison the reward:
Retroactively adjust to preserve backup:
- Update replay buffer and optimize policy as usual.
Notably, retroactive reward correction ensures that benign transitions remain unaffected in expected value, maintaining stealth.
Gradient-Based Adaptive Robustness Attacks (Schott et al., 12 Jan 2026)
Deep RL attacks select a unit-norm direction via maximal negative gradient of the critic, and search over candidate perturbation magnitudes :
- Compute and .
- Form .
- Search for maximal such that .
- Apply perturbation.
Joint critic networks (dynamic) and (static) are trained off-policy to evaluate robustness across radii and actions.
4. Theoretical Guarantees and Analytical Characterization
Theoretical analysis in both backdoor and robust settings delivers the following guarantees:
- Backdoor Success and Stealth (SleeperNets): Theorems establish, under capacity and data sufficiency, that the optimal poisoned policy will (i) choose with probability one on trigger, and (ii) match return with the clean policy on benign states, achieving exactly (Rathbun et al., 2024).
- Per-State Robustness Constraint (Reward-Preserving): For any , the per-state or per-action return under attack is bounded below by , preserving an fraction of the nominal-to-worst-case gap (Schott et al., 12 Jan 2026).
Additional properties include reward structure preservation (the Q-value ordering remains unchanged for extremes) and preference shift quantification, determining when action selection orderings reverse as a function of adversary strength and .
5. Defense Strategies and Performance Guarantees
Defensive frameworks against -reward-preserving attacks, as established in (Banihashem et al., 2021), utilize robust optimization over occupancy measures:
- LP Formulation: Agents optimize over all policies to maximize worst-case return under the poisoned reward vector subject to known or bounded .
- Certificates: For any defense policy derived, worst-case return on the true reward will not fall below its performance on , and its suboptimality gap relative to the attacker's target policy is tightly upper-bounded.
- Extensions: When is unknown, setting an upper bound preserves guarantees provided over-estimation. Under-estimation leads to collapse to the attacker's target.
These optimization programs are tractable and yield nontrivial guarantees for robustness against minimally stealthy reward-poisoning.
6. Empirical Protocols and Quantitative Results
Empirical studies in (Rathbun et al., 2024, Schott et al., 12 Jan 2026) evaluate attack and defense across diverse environments:
| Method | Attack Success Rate (ASR) | Benign Return Ratio (BRR) |
|---|---|---|
| SleeperNets | 100% | ≥96.5% |
| TrojDRL-W | 57–99% | 26.6–100% |
| BadRL-M | 0–100% | 70–100% |
- SleeperNets achieves perfect attack success and near-perfect return preservation across tasks, often with poisoning budgets .
- Adaptive Robustness Attacks demonstrate, e.g., on HalfCheetah-v5, that yields robust returns across adversarial radii () without nominal collapse; fixed-radius baselines lack such generalization (Schott et al., 12 Jan 2026).
Robustness profiles support the trade-off: intermediate delivers substantial distributional robustness with controlled nominal return loss.
7. Strengths, Limitations, and Research Directions
-reward-preserving attacks offer:
- Provable stealth: Guaranteed preservation of reward performance for arbitrarily strong backdoors or perturbations.
- Universality: Applicability across MDPs, observation spaces, and RL architectures.
- Efficient resource use: Capability to anneal poisoning budgets () to extremely low values.
Limitations include the potential detectability of large magnitude reward perturbations via reward-signal monitoring and requirements for threat models with access to full episode data. Assumptions on trigger function invertibility or support disjointness may restrict practical scope.
Open questions and future extensions include:
- Development of monitoring defenses sensitive to reward distribution offsets.
- Integration of gradient-based trigger optimization for further stealth and efficiency.
- Generalization of -reward-preserving principles to multi-agent and cooperative RL.
- Exploration of partial for enhanced stealth with weaker constraints.
- Relaxation of invertibility and disjoint support in trigger generation.
These considerations continue to drive research into both the precision of adversarial attack design and principled defense in reinforcement learning.