Papers
Topics
Authors
Recent
Search
2000 character limit reached

α-Reward-Preserving Attacks in RL

Updated 19 January 2026
  • α-Reward-Preserving Attacks are adversarial interventions in reinforcement learning designed to manipulate reward signals while maintaining an α fraction of the policy’s nominal return.
  • These attacks leverage techniques like dynamic reward poisoning and gradient-based perturbations to balance stealth with effective control over agent behavior.
  • Empirical studies show high success rates and near-optimal return preservation, highlighting the critical need for robust defense mechanisms in RL systems.

The concept of αα-reward-preserving attacks in reinforcement learning (RL) defines a class of adversarial interventions designed to induce a specified malicious behaviour—such as the forced execution of an action upon presentation of a trigger—while concurrently maintaining a policy’s nominal return above an αα fraction of its unpoisoned counterpart. These attacks are distinguished by their combination of stealth (high reward preservation, α1α \rightarrow 1) and potent control over victim agent behaviour. Research spanning backdoor poisoning, robust RL, and reward perturbation optimization has developed precise theoretical, algorithmic, and empirical frameworks for both attack and defense, as documented in "SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents" (Rathbun et al., 2024), "Reward-Preserving Attacks For Robust Reinforcement Learning" (Schott et al., 12 Jan 2026), and "Defense Against Reward Poisoning Attacks in Reinforcement Learning" (Banihashem et al., 2021).

1. Formal Definitions and Threat Model

At its core, an αα-reward-preserving attack pertains to an RL agent trained on a Markov decision process (MDP) M=(S,A,T,R,γ)M = (S, A, T, R, γ), comprising a state space SS, action set AA, transition kernel TT, reward function RR, and discount factor γ[0,1)γ \in [0,1). Let πclean\pi_{clean} denote the policy trained without intervention, yielding expected discounted return J(πclean)J(\pi_{clean}). An adversarially trained policy πpoison\pi_{poison} under a backdoor or perturbation mechanism is said to be αα-reward-preserving if J(πpoison)αJ(πclean)J(\pi_{poison}) \geq α \cdot J(\pi_{clean}), i.e., the poisoned agent achieves at least fraction αα of the original return.

In backdoor threat models (Rathbun et al., 2024), an adversary defines a trigger function δ:SSpδ: S \rightarrow S_p, mapping benign states to poisoned versions. Poisoning is achieved by injecting δ(s)δ(s) during training with budget ββ, alongside reward manipulation through dynamic functions ΔRΔR. In reward-perturbation settings (Schott et al., 12 Jan 2026, Banihashem et al., 2021), the attacker solves a constrained optimization to induce a target policy π\pi^† as uniquely optimal within an 2\ell_2-distance-minimizing modification of RR, enforcing margin αα over all neighbouring policies.

This threat model introduces both a measure of attack strength (progress towards the adversary’s goals) and stealth (avoidance of detection through performance metrics).

2. Attack Construction and Optimization Strategies

The adversary’s optimization problem formalizes two competing objectives:

  1. Attack Success: With high probability, induce a specified adversarial outcome—e.g., force action a+a^+ when a trigger is present.
  2. Reward Preservation: Maintain episodic or per-state return within an α\alpha fraction of the nominal baseline, minimizing the risk of detection or performance degradation.

This dual-objective is typically cast as: maxMDP=(SSp,A,T,R)  Esd,πL(MDP)[π(δ(s),a+)]s.t.  JM(π)αJM(πclean)\max_{\text{MDP}'=(S ∪ S_p, A, T', R')}\;E_{s \sim d, \pi'\sim L(\text{MDP}')}\left[\pi'(δ(s), a^+)\right] \quad \text{s.t.}\;J_M(\pi')\geq α J_M(\pi_{clean}) where L(MDP)L(\text{MDP}') denotes the learning algorithm (e.g., PPO), and JMJ_M evaluates on the original MDP.

In "SleeperNets" (Rathbun et al., 2024), the reward-poisoning function ΔRΔR is chosen dynamically such that:

  • The Bellman backup at benign states is preserved exactly.
  • At poisoned states, reward immediately and exclusively favours a+a^+, indifferent to downstream value.
  • The overall optimal policy in the modified MDP coincides with an αα-optimal policy for the benign task.

In robust RL settings (Schott et al., 12 Jan 2026), attacks are tuned per state and action by decomposing perturbations into a gradient-derived direction A(s,a)A(s,a) and magnitude η(s,a)η(s,a), maximizing adversarial impact subject to the per-state constraint: Q,Ωξ(s,a)Q,Ωξ(s,a)+α[Q,Ω(s,a)Q,Ωξ(s,a)]Q^{*, \Omega^\xi}(s,a) \geq Q^{*, \Omega^{\xi^*}}(s,a) + α \left[ Q^{*,\Omega}(s,a) - Q^{*, \Omega^{\xi^*}}(s,a) \right] where Ωξ\Omega^{\xi^*} denotes the worst-case adversary and Q,Ω(s,a)Q^{*,\Omega}(s,a) the nominal Q-value.

3. Algorithmic Realizations

Attacks are implemented via an episode-level procedure:

  • Sample trajectory HH under policy π\pi.
  • Randomly select a fraction ββ of transitions in HH for poisoning.
  • For each poisoned (st,at,rt)(s_t, a_t, r_t):
    • Apply trigger: stδ(st)s_t \leftarrow δ(s_t).
    • Poison the reward: rt1c[at=a+]αγV^(st+1)r_t \leftarrow 1_c[a_t = a^+] - α γ \hat{V}(s_{t+1})
    • Retroactively adjust rt1r_{t-1} to preserve backup:

      rt1rt1γrt+γV^(st)r_{t-1} \leftarrow r_{t-1} - γ r_t + γ \hat{V}(s_t)

  • Update replay buffer and optimize policy as usual.

Notably, retroactive reward correction ensures that benign transitions remain unaffected in expected value, maintaining stealth.

Deep RL attacks select a unit-norm direction A(s,a)A(s,a) via maximal negative gradient of the critic, and search over candidate perturbation magnitudes ηη:

  • Compute Q0=Qψc((s,a),ηB)Q_0 = Q_{ψ_c}((s,a), η_B) and Q1=Qψc((s,a),0)Q_1 = Q_{ψ_c}((s,a), 0).
  • Form Q^α(s,a)=Q0+α(Q1Q0)\widehat{Q}_α(s,a) = Q_0 + α(Q_1 - Q_0).
  • Search for maximal ηη such that Qψα((s,a),η)Q^α(s,a)Q_{ψ_α}((s,a), η) \geq \widehat{Q}_α(s,a).
  • Apply ηA(s,a)η^* A(s,a) perturbation.

Joint critic networks QψαQ_{ψ_α} (dynamic) and QψcQ_{ψ_c} (static) are trained off-policy to evaluate robustness across radii and actions.

4. Theoretical Guarantees and Analytical Characterization

Theoretical analysis in both backdoor and robust settings delivers the following guarantees:

  • Backdoor Success and Stealth (SleeperNets): Theorems establish, under capacity and data sufficiency, that the optimal poisoned policy will (i) choose a+a^+ with probability one on trigger, and (ii) match return with the clean policy on benign states, achieving α=1α=1 exactly (Rathbun et al., 2024).
  • Per-State Robustness Constraint (Reward-Preserving): For any (s,a)(s,a), the per-state or per-action return under attack is bounded below by Vworst(s)+αΔ(s)V^{worst}(s) + α \Delta(s), preserving an αα fraction of the nominal-to-worst-case gap (Schott et al., 12 Jan 2026).

Additional properties include reward structure preservation (the Q-value ordering remains unchanged for extremes) and preference shift quantification, determining when action selection orderings reverse as a function of adversary strength and αα.

5. Defense Strategies and Performance Guarantees

Defensive frameworks against αα-reward-preserving attacks, as established in (Banihashem et al., 2021), utilize robust optimization over occupancy measures:

  • LP Formulation: Agents optimize over all policies to maximize worst-case return under the poisoned reward vector R^\hat{R} subject to known or bounded αα.
  • Certificates: For any defense policy πD\pi_D derived, worst-case return on the true reward RR' will not fall below its performance on R^\hat{R}, and its suboptimality gap relative to the attacker's target policy π\pi^† is tightly upper-bounded.
  • Extensions: When αα is unknown, setting an upper bound αˉ\bar α preserves guarantees provided over-estimation. Under-estimation leads to collapse to the attacker's target.

These optimization programs are tractable and yield nontrivial guarantees for robustness against minimally stealthy reward-poisoning.

6. Empirical Protocols and Quantitative Results

Empirical studies in (Rathbun et al., 2024, Schott et al., 12 Jan 2026) evaluate attack and defense across diverse environments:

Method Attack Success Rate (ASR) Benign Return Ratio (BRR)
SleeperNets 100% ≥96.5%
TrojDRL-W 57–99% 26.6–100%
BadRL-M 0–100% 70–100%
  • SleeperNets achieves perfect attack success and near-perfect return preservation across tasks, often with poisoning budgets β<0.001%β < 0.001\%.
  • Adaptive Robustness Attacks demonstrate, e.g., on HalfCheetah-v5, that α0.7α ≈ 0.7 yields robust returns across adversarial radii (ηη) without nominal collapse; fixed-radius baselines lack such generalization (Schott et al., 12 Jan 2026).

Robustness profiles support the trade-off: intermediate αα delivers substantial distributional robustness with controlled nominal return loss.

7. Strengths, Limitations, and Research Directions

αα-reward-preserving attacks offer:

  • Provable stealth: Guaranteed preservation of reward performance for arbitrarily strong backdoors or perturbations.
  • Universality: Applicability across MDPs, observation spaces, and RL architectures.
  • Efficient resource use: Capability to anneal poisoning budgets (ββ) to extremely low values.

Limitations include the potential detectability of large magnitude reward perturbations via reward-signal monitoring and requirements for threat models with access to full episode data. Assumptions on trigger function invertibility or support disjointness may restrict practical scope.

Open questions and future extensions include:

  • Development of monitoring defenses sensitive to reward distribution offsets.
  • Integration of gradient-based trigger optimization for further stealth and efficiency.
  • Generalization of αα-reward-preserving principles to multi-agent and cooperative RL.
  • Exploration of partial α\alpha for enhanced stealth with weaker constraints.
  • Relaxation of invertibility and disjoint support in trigger generation.

These considerations continue to drive research into both the precision of adversarial attack design and principled defense in reinforcement learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to $α$-Reward-Preserving Attacks.