Adversarial Gating Training (AGT)
- Adversarial Gating Training (AGT) is a machine unlearning paradigm that reformulates data erasure as a latent min–max game to protect against adversarial recovery while preserving model utility.
- It employs a curriculum-based gating mechanism that triggers adversarial attacks only after model gradients stabilize, thus preventing premature instability and catastrophic forgetting.
- AGT integrates Adaptive Orthogonality to resolve gradient conflicts between forgetting and retention losses, enhancing robustness against recovery via quantization or rapid re-learning.
Adversarial Gating Training (AGT) is a machine unlearning paradigm designed to achieve robust knowledge erasure while maintaining retention utility in LLMs and related deep neural networks. AGT reconfigures unlearning as a latent-space min–max game, incorporating adversarial attacks within the model’s internal representations and employing a curriculum-based gating mechanism to maximize erasure resistance without catastrophic forgetting. Most recently, AGT is unified with Adaptive Orthogonality (AO) in the AGT framework to dynamically resolve geometric gradient conflicts, enabling stable unlearning even in regimes subject to adversarial recovery threats such as jailbreaks, quantization, or rapid re-learning (Li et al., 2 Feb 2026).
1. Motivation and Problem Definition
Current LLMs risk unintentional memorization or leakage of sensitive training data. Standard unlearning strategies face a fundamental dilemma:
- Catastrophic Forgetting: Directly pushing model behavior away from a forget set can indiscriminately degrade utility across all inputs, impairing performance on necessary retention data .
- Superficial Forgetting: Conservative or superficial approaches may mask undesired knowledge without fully removing it, leaving models open to adversarial recovery via prompt engineering, latent-space perturbations, model quantization, or rapid re-learning.
Adversarial Gating Training is motivated by the need to jointly optimize for robust erasure—such that hidden activation perturbations, jailbreak prompts, or quantization-based attacks cannot resurrect erased information—while preserving downstream model utility.
2. AGT Methodology: Latent-Space Min–Max Game and Gating Mechanism
AGT formulates unlearning as a two-player latent min–max game:
- The defender (outer minimizer) updates model parameters to erase information associated with the forget set while preserving performance on the retain set .
- The attacker (inner maximizer) searches for worst-case perturbations in the hidden state of forget-set data, maximizing the unlearning loss and attempting to recover “forgotten” knowledge.
A curriculum-based gating variable controls invocation of the attacker. The initial training phase (warm-up, ) excludes adversarial perturbation. Subsequently, the attack phase () is triggered only once the gradient norm of the unlearning loss with respect to falls below a threshold , preventing premature instability and oscillations.
3. Mathematical Formalism
Let denote the forget set and the retain set. For input with label , the hidden representations at layer are and .
The core losses are:
The unified unlearning objective with AO regularization is:
The min–max formulation is: where the attacker optimizes (via projected gradient descent) to maximize erasure vulnerability, while the defender updates to minimize unlearning loss.
The gating mechanism is defined as
Inner maximization is only performed if .
4. Adaptive Orthogonality for Gradient Conflict Resolution
The AO (Adaptive Orthogonality) module penalizes destructive interference between the gradients of forgetting and retention losses. Given
the AO penalty is
or equivalently,
where projects onto the subspace orthogonal to . This regularization is incorporated into the overall loss with dynamic weight , adaptively mitigating geometric gradient conflict and stabilizing multi-objective optimization.
5. Training Algorithm and Curriculum
AGT proceeds as follows:
- Warm-up Phase ():
- Iterate for steps with no inner adversarial maximization.
- Only perform standard unlearning updates (removal and retention losses, AO regularizer).
- Conditional Adversarial Phase ():
- When , identify the worst-case via -step PGD:
- Update on total loss including AO.
Algorithmic outline:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for step t in 1..T: Sample batches from D_f, D_r Compute losses L_f, L_r, gradients g_f, g_r Compute AO penalty R_AO G_t = 0 if t <= N_warmup else (||∇θ(L_f+L_r+λ_ao·R_AO)||_2 < τ_grad) if G_t == 1: δ = 0 for k in 1..K: δ = Proj_{||.||_p≤ε}[δ + α·sign(∇_δ L_unlearn(θ, h_f+δ, h_r))] else: δ = 0 L = L_forget(h_f+δ) + L_retain(h_r) + λ_ao(t)·R_AO θ = θ - η·∇θ L |
The curriculum initially restricts adversarial intervention, shifting to adversarial gating only as the retention–forgetting loss landscape stabilizes.
6. Empirical Evaluation and Comparison
AGT has been evaluated on several unlearning-critical tasks and benchmarks, including TOFU (fictional author biographies), MUSE (copyright removal), and WMDP (hazardous cybersecurity Q&A). Key evaluation metrics:
- Unlearning Efficacy: Forget Quality, Knowledge Unlearning Ratio (KUR), ROUGE scores, WMDP accuracy.
- Utility Preservation: Model Utility Score, Fluency, MMLU accuracy.
- Privacy Leakage: Privacy Leakage Ratio (PLR), membership-inference AUC.
Summary of results (TOFU dataset, LLaMA-2-7B-chat, mean of three runs):
| Method | Forget Quality ↑ | KUR ↓ | Model Utility ↑ | Fluency ↑ | PLR |
|---|---|---|---|---|---|
| AGT | ‒9.43 | 0.01 | 0.59 | 0.90 | 0.53 |
| LAT | ‒12.50 | 0.05 | 0.41 | 0.70 | 0.55 |
| RMU | ‒14.20 | 0.14 | 0.45 | 0.76 | 0.59 |
| NPO | ‒19.78 | 0.30 | 0.00 | 0.02 | 0.40 |
On WMDP-Cyber (Zephyr-7B-β model), AGT reduces hazardous Q&A accuracy from 44.0 (target) to 25.3, while retaining MMLU = 58.3. Ablation studies indicate that removing AO causes utility to drop (0.59→0.39), removing AGT degrades forget quality (‒9.43→‒31.59), and eliminating gating increases KUR (0.01→0.60) (Li et al., 2 Feb 2026).
7. Discussion, Limitations, and Future Directions
Adversarial gating exposes the model to worst-case latent perturbations only once parameter evolution has stabilized, mitigating overfitting and catastrophic collapse. The formulated min–max game flattens the loss landscape in directions susceptible to data resurrection, increasing unlearning robustness against latent space and quantization attacks. The AO component promotes gradient orthogonality, thus reducing optimization interference between forgetting and retention objectives.
Primary limitations include the added computational burden from inner-loop adversarial optimization and open questions of scalability to extremely large models (100B parameters). Future work may investigate alternatives for gating semantics (e.g., curvature-based triggers), dynamic budget selection for adversarial attacks, and lightweight approximations of inner maximization.
In summary, AGT operationalizes a rigorous, curriculum-driven, adversarially robust unlearning framework that delivers state-of-the-art knowledge erasure (KUR ≈ 0.01) while retaining high generalization (MMLU ≈ 58.3, Fluency ≈ 0.90) and resilience against both quantization- and re-learning-based recovery (Li et al., 2 Feb 2026).