Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?

Published 7 Jun 2025 in cs.LG and cs.CR | (2506.06891v1)

Abstract: We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results. Furthermore, we extend our evaluation to the MDP setting, confirming that the robustness observed in bandit scenarios generalizes to more complex environments.

Abstract PDF Upgrade to Chat

Summary

The paper proposes AT-DPT, an adversarial training protocol that recovers optimal RL strategies from reward poisoning attacks.
It presents extensive evaluations showing AT-DPT consistently outperforms standard bandit algorithms in minimizing cumulative regret.
The study demonstrates that AT-DPT robustly generalizes from bandit settings to complex MDP environments, enhancing overall RL security.

Analysis of Corruption-Robustness in In-Context Reinforcement Learning

The paper "Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?" presents a comprehensive study on the vulnerability and robustness of transformer-based decision-making models against reward poisoning attacks in reinforcement learning (RL). Reward poisoning, a training-time adversarial attack, can fundamentally alter the course of learning in RL systems, especially those employing decision-pretrained transformers like the Decision-Pretrained Transformer (DPT). The study introduces the Adversarially Trained Decision-Pretrained Transformer (AT-DPT) as a solution to address this vulnerability.

Methodology and Contributions

The core contribution of the paper is the development of a robust adversarial training protocol for in-context reinforcement learning systems. The authors propose AT-DPT, which integrates adversarial training mechanisms to enhance model resilience against reward poisoning:

Adversarial Training Framework: The training framework involves simultaneously training an adversary, designed to minimize the model’s true reward through selective poisoning of environment rewards, alongside the DPT model to derive optimal actions from poisoned datasets.
Evaluation Against Baselines: The authors benchmark AT-DPT against standard bandit algorithms, including robust ones equipped to handle reward contamination. Their extensive evaluations cover bandit settings and adaptive attacker scenarios.
Generalization to Complex Environments: Beyond the bandit setting, the resilience of AT-DPT is extended to MDP scenarios, demonstrating the model’s robustness against poisoning across diverse, complex environments.

Results

The numerical results underline the efficacy of AT-DPT in recovering optimal strategies from contaminated reward signals in a variety of settings:

Significant Outperformance: AT-DPT consistently surpasses other baselines, including robust bandit algorithms, in terms of cumulative regret in adversarially perturbed environments.
Adaptive Attack Robustness: The model maintains superior performance even when pitted against adaptive adversaries, showcasing its ability to learn and recuperate from varied attack strategies.
Robust Extension to MDP: Results indicate that robustness, initially proven in bandit scenarios, generalizes effectively to more complex MDP environments, thereby offering broad applicability.

Implications and Future Directions

The implications of the presented work are profound, especially in the deployment of RL systems in real-world scenarios where security threats are omnipresent:

Improving Security in RL: The adversarial training methodology can be instrumental in fortifying RL systems against various forms of data and reward contamination, enhancing safe and reliable deployment in sensitive applications.
Further Exploration of Adaptive Methods: Given the successful integration of adaptive strategies, future research may explore more sophisticated adaptive algorithms that continuously improve against evolving adversarial tactics.
Expansion to Other In-Context Learning Domains: The concept could be extended to investigate robustness in other in-context learning domains where transformers are leveraged, creating opportunities for cross-domain advancements in AI robustness.

In conclusion, the paper provides a methodically sound and empirically validated approach to enhancing the robustness of RL systems against reward poisoning. The introduction of AT-DPT marks a significant stride in addressing a critical security challenge, opening pathways for further exploration and development in adversarially robust AI systems.

Markdown Report Issue