AC-TGPO: Joint Attack-Defense Policy Optimization
- The paper demonstrates that AC-TGPO efficiently optimizes joint attack-defense policies using a dual-MDP framework and tree-aware, group-normalized PPO.
- It introduces an adversarial curriculum that dynamically balances normal, asymmetric, and hard samples with GS-MCTS for escalating challenges.
- Empirical results reveal significant improvements in safety metrics, setting a new benchmark for robust LLM jailbreak defense.
Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO) is a reinforcement learning module designed for joint attack and defense policy training in LLMs, particularly in adversarial settings such as jailbreak attack-defense co-evolution. Introduced as a core component of the ACE-Safety framework, AC-TGPO integrates policy optimization with adversarial curriculum learning, utilizing a tree-aware, group-normalized approach to improve the robustness and mutual advancement of attacker and defender LLMs through joint training on dynamically difficult samples (Li et al., 24 Nov 2025).
1. Formalization of Joint Attack-Defense Optimization
AC-TGPO models both the attacker LLM () and the defender LLM () as parameterized policies within two parallel Markov Decision Processes (MDPs). The attack policy operates over nodes in the Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS) tree:
- Attack MDP: State consists of the original malicious query , a group of LLM-generated candidate rewrites , their corresponding defense model responses , and judge scores (harm, responsibility, co-relevance). The action space comprises discrete prompt-rewriting strategies. Upon taking action , new rewrites are generated, responses judged, and the tree updated. The attack reward , conditioned on a co-relevance threshold, directly incentivizes maximal harmfulness among group candidates.
- Defense MDP: Each state is a single adversarial input (the most harmful rewrite as assessed by the judge model). The defender outputs tokens sequentially with standard LLM autoregressive generation. The reward rewards low harmfulness and high responsibility in responses.
This dual-MDP interlocking structure enables adversarial co-evolution during policy optimization (Li et al., 24 Nov 2025).
2. Adversarial Curriculum Reinforcement Learning
AC-TGPO implements a multi-round adversarial curriculum strategy with iterations:
- Normal Set (): Collected by running GS-MCTS using the current attack and defense policies, yielding typical challenge samples.
- Asymmetric Set (): Generated by attacking the defense model from the previous curriculum round, mining samples where the legacy defense is vulnerable.
- Hard Set (): Derived by re-testing the newest defense on previously merged hard cases, harvesting samples that resist current mitigation.
These sets are merged per round into , with early epochs in each round over-sampling hard and asymmetric scenarios to accelerate difficulty adaptation. The sample composition and hardness are therefore dynamically calibrated by adversarial interaction (Li et al., 24 Nov 2025).
3. Tree-Aware Group Policy Optimization
AC-TGPO employs a PPO-style policy optimizer with advanced normalization procedures at both the group and search-tree level:
- Group-level normalization: For each group of rollouts per training example, rewards are standardized within the group:
- Tree-level normalization: Rollouts are also normalized across the MCTS search tree using depth-aware, discounted weights to reflect both local sample quality and the global search context.
- Policy update objective: The per-token PPO loss is computed with a clipped policy probability ratio, augmented by a KL penalty term:
Parameters: PPO clip , KL penalty .
Differentiation from vanilla PPO is achieved by the two-stage normalization and explicit integration of MCTS tree statistics. This enables variance stabilization and effective credit assignment in highly non-stationary, adversarial LLM training regimes (Li et al., 24 Nov 2025).
4. Network Architecture and Parameterization
Both attacker and defender are instantiated as LLMs sharing identical backbone architectures (e.g., Vicuna-7B/13B, Llama3-8B, or Mistral-7B-0.3) and transformer stacks. There are no additional task-specific decoder heads; differentiation between attack and defense roles is controlled solely via input prompt templates. Curriculum stage cues are exclusively injected through data sampling, not model architecture. Judging for reward computation is performed by a frozen reference LLM (GPT-4, temperature 0) (Li et al., 24 Nov 2025).
5. Optimization Procedures and Hyperparameter Choices
Training utilizes 8× NVIDIA H800 GPUs, PyTorch, and AdamW with linear warmup ( peak). Key settings include microbatch size of 1 sequence per GPU, group size , curriculum length , GS-MCTS search steps per query, depth discount , exploration constant , and jailbreak threshold . PPO-specific settings are , . Generation temperatures of improve diversity, while the judge model uses . Regularization is enforced through the KL loss constraint; gradient norms are clipped to 1.0. No explicit entropy regularization beyond the PPO clipped objective is included (Li et al., 24 Nov 2025).
6. Empirical Outcomes and Ablation Analysis
Empirical evaluation demonstrates substantial increases in both attack and defense robustness:
| Metric | ACE-Safety (Attack) | ACE-Safety (Defense) | Baseline |
|---|---|---|---|
| ASR-LR (↑ is worse for defense) | 95.2% (Vicuna-13B, 7.4 ANA) | 7.3% (Vicuna-7B, TAP) | varied; all less robust |
| Helpfulness | - | 5.4/10 (MT-Bench, AlpacaEval) | lower |
| Robustness (OST/SAT) | - | See Tables 2–4, Fig. 5–7 | lower |
| Responsibility | - | CValues-RP | lower |
Ablation studies reveal that freezing the attack model () increases ASR by 3 points, removing GS-MCTS or prior context each increases ASR by 4 points, disabling tree-aware normalization adds 2 points, and removing asymmetric or hard samples increases ASR by 1.5 points. Each component of the AC-TGPO regime is therefore substantiated as critical for final system robustness (Li et al., 24 Nov 2025).
7. Context and Significance
AC-TGPO combines group-normalized, tree-aware PPO-based policy optimization, adversarial curriculum scheduling, and joint attack-defense co-training in adversarial LLM safety alignment. This configuration allows for continuous mutual advancement in both attack capability and defense robustness. In contrast to prior approaches that optimize only attackers or defenders in isolation, AC-TGPO operationalizes a co-evolutionary paradigm where the sample hardness and adversarial tactics escalate symmetrically with training progress. The resulting models set new benchmarks for LLM safety in jailbreak settings. A plausible implication is that group-level and tree-context-aware normalization can be generally beneficial in adversarial RL for other domains with non-stationary, self-escalating objectives (Li et al., 24 Nov 2025).