CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Published 5 Apr 2026 in cs.CR and cs.AI | (2604.04060v1)

Abstract: As LLMs are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper demonstrates a stateful, multi-agent defense (CoopGuard) that adapts in real-time to counter dynamic, multi-round adversarial attacks.
It achieves a 78.9% reduction in attack success rates, a 186% improvement in deception, and a 167.9% increase in attacker resource expenditure.
The study highlights practical middleware deployment for LLMs, preserving user experience with only a modest impact on benign interactions.

CoopGuard: Adaptive Multi-Agent Defense for Evolving Multi-Round LLM Jailbreak Attacks

Motivation and Problem Setting

LLMs are susceptible to adversarial prompt attacks, particularly those that refine and escalate across multiple interaction rounds. Existing defenses predominantly rely on stateless, per-query judgements or fixed refusal triggers, strategies that are systematically bypassed by dynamic adversaries able to exploit prompt variations and conversational history. As attackers leverage independent-yet-evolving multi-round tactics, there is a clear need for proactive, adaptive, and context-aware mitigation strategies.

The authors introduce CoopGuard, a defense framework designed to address the limitations of static approaches by orchestrating a set of cooperative agents to maintain and adapt an internal defense state throughout multi-round adversarial exchanges.

Figure 1: Illustration of the adversary's multi-round strategy evolution and the CoopGuard framework’s adaptive, stateful defense mechanism.

CoopGuard Framework Design

Defense Architecture

CoopGuard organizes LLM defenses as a multi-agent system coordinated by a central System Agent. The framework maintains a state $h_t$ updated each round, incorporating recent attacker behaviors, deception outcomes, and forensic evidence. Its agents include:

Deferring Agent (DA): Dynamically injects ambiguity to slow adversarial probing, escalating resistance as cumulative suspiciousness grows. The DA leverages recency-weighted risk estimates to identify iterative attack progression.
Tempting Agent (TA): Generates adaptive, state-conditioned decoys aimed at misleading attackers into non-productive or benign query paths. Conditioning on the evolving state ensures narrative consistency and sustained deception.
Forensic Agent (FA): Aggregates, classifies, and summarizes attacker behaviors, flagging emerging patterns or escalation in strategies. FA outputs structured evidence reports to facilitate adaptive policy updating.
System Agent (SA): Centralizes orchestration and policy fusion, integrating DA, TA, and FA signals to select optimal defense actions and update the shared state.
Figure 2: CoopGuard’s multi-agent cooperative architecture: DA impedes, TA tempts, FA logs and analyzes, and SA coordinates defense adaptation across rounds.

This structure contrasts with prior defenses that treat each query or session monolithically, enabling CoopGuard to accumulate contextual evidence, proactively escalate defenses, and exhaust adversarial probing resources.

EMRA Multi-Round Attack Evaluation and Benchmark

To systematically evaluate CoopGuard and baseline methods, the authors introduce EMRA, a comprehensive multi-round adversarial benchmark comprising 5,200 attack trajectories spanning eight typologically diverse jailbreak strategy classes. Sequences in EMRA capture the independent-yet-escalating attack pattern, with each round representing an autonomous attempt followed by incremental refinements in obfuscation, indirectness, or prompt structure.

The EMRA taxonomy supports both query-level and episode-level robustness analysis, allowing for precise diagnosis of defense efficacy by attack strategy type.

Figure 3: Breakdown of resource consumption (tokens) for attack and defense agents by question category and jailbreak strategy during multi-round adversarial episodes.

Figure 4: Distribution of prompt lengths and attacker-defender token ratio, illustrating the varying cost and complexity across different jailbreak strategies.

Experimental Protocol and Results

Experimental Setup

Evaluations span three major LLM backbones (GPT-5, Gemini-2.5-Pro, DeepSeek-V3). CoopGuard agents are instantiated on GPT-4, with structured prompt templates enforcing specialization and state tracking. Five state-of-the-art defenses serve as baselines, including adversarial training, defensive prompt engineering, and safety-aware objective tuning.

Evaluation metrics include:

Attack Success Rate (ASR): The proportion of prompts eliciting harmful outputs.
Deception Rate (DR): The frequency at which adversaries are misled into dead-end or benign query trajectories.
Attacker Efficiency (AE): Average number of tokens attackers must expend per dialogue, reflecting operational cost.
User Experience: Measured on MT-BENCH-101 via Cosafe protocol to ensure benign usage is not degraded by defense activity.

Main Outcomes

Robustness: CoopGuard reduces ASR by 78.9% relative to the strongest baseline, demonstrating robust mitigation across all major backbones and attack types. The system generalizes to both direct harmful and indirect or obfuscated prompts and is resilient even against sophisticated, evolving multi-round attacks.

Deception and Resource Drain: DR is improved by 186%, and AE is increased by 167.9%—attackers expend substantially more resources with reduced efficacy. CoopGuard’s agents maintain deceptive narratives that sustain adversarial engagement without conceding sensitive information.

Figure 5: CoopGuard forces adversaries to expend more tokens on average compared to baseline defenses, greatly increasing the operational cost of multi-round attacks.

Evaluation Consistency: Results are robust across judges, with human and Deepseek-based auto-evaluation confirming trends observed with primary GPT-Judge metrics.

Figure 6: Cross-validation with independent evaluators shows consistent superiority of CoopGuard in ASR reduction and DR improvement.

Usability: While CoopGuard’s layered defense moderately reduces Depth and Completeness in benign conversations (by up to ~7%), Politeness, Clarity, and Accuracy are preserved, indicating high compatibility with legitimate user interactions.

Analysis, Implications, and Future Directions

The CoopGuard framework directly addresses critical limitations observed in stateless and purely reactive defenses. Its stateful, multi-agent design enables fine-grained escalation of ambiguity and misdirection, adversary engagement trapping, and ongoing forensic analysis. The approach's sustained, context-aware deception—rather than simple static blocking—effectively disrupts multi-round escalation, forcing attackers into wasteful loops and dramatically raising their operational burden.

There are several implications:

Practical Deployment: Deploying CoopGuard as a middleware for LLM-based interfaces can offer enhanced security without substantial usability loss, important for high-stakes or large-scale environments.
Theoretical Perspective: This work demonstrates the efficacy of adaptive, game-theoretic multi-agent systems in practical machine learning security, opening directions for integrating adversarial modeling, active learning, and automated forensic traceability.
Future Research: Open problems include optimizing agent-policy training for unseen dynamic threats, reducing the minimal defense overhead, and adversarial co-evolution. Extensions could explore meta-learning-based configuration, incorporating user feedback in forensic state updates, and defense against multimodal or chain-of-agent attacks.

Conclusion

CoopGuard represents a significant advancement in LLM defense, combining stateful, multi-agent cooperation and dynamic adaptation to counter evolving multi-round attacks (2604.04060). Its ability to sustain coordinated, deceptive, and resource-draining defense—while preserving user experience—demonstrates a robust pathway for securing next-generation LLM deployments in adversarial environments. The EMRA benchmark further enables reproducible evaluation and future comparative research in this domain.

Markdown Report Issue