Adversarial Gating Training (AGT)

Updated 9 February 2026

Adversarial Gating Training (AGT) is a machine unlearning paradigm that reformulates data erasure as a latent min–max game to protect against adversarial recovery while preserving model utility.
It employs a curriculum-based gating mechanism that triggers adversarial attacks only after model gradients stabilize, thus preventing premature instability and catastrophic forgetting.
AGT integrates Adaptive Orthogonality to resolve gradient conflicts between forgetting and retention losses, enhancing robustness against recovery via quantization or rapid re-learning.

Adversarial Gating Training (AGT) is a machine unlearning paradigm designed to achieve robust knowledge erasure while maintaining retention utility in LLMs and related deep neural networks. AGT reconfigures unlearning as a latent-space min–max game, incorporating adversarial attacks within the model’s internal representations and employing a curriculum-based gating mechanism to maximize erasure resistance without catastrophic forgetting. Most recently, AGT is unified with Adaptive Orthogonality (AO) in the AGT $^{AO}$ framework to dynamically resolve geometric gradient conflicts, enabling stable unlearning even in regimes subject to adversarial recovery threats such as jailbreaks, quantization, or rapid re-learning (Li et al., 2 Feb 2026).

1. Motivation and Problem Definition

Current LLMs risk unintentional memorization or leakage of sensitive training data. Standard unlearning strategies face a fundamental dilemma:

Catastrophic Forgetting: Directly pushing model behavior away from a forget set $\mathcal{D}_f$ can indiscriminately degrade utility across all inputs, impairing performance on necessary retention data $\mathcal{D}_r$ .
Superficial Forgetting: Conservative or superficial approaches may mask undesired knowledge without fully removing it, leaving models open to adversarial recovery via prompt engineering, latent-space perturbations, model quantization, or rapid re-learning.

Adversarial Gating Training is motivated by the need to jointly optimize for robust erasure—such that hidden activation perturbations, jailbreak prompts, or quantization-based attacks cannot resurrect erased information—while preserving downstream model utility.

2. AGT Methodology: Latent-Space Min–Max Game and Gating Mechanism

AGT formulates unlearning as a two-player latent min–max game:

The defender (outer minimizer) updates model parameters $\theta$ to erase information associated with the forget set $\mathcal{D}_f$ while preserving performance on the retain set $\mathcal{D}_r$ .
The attacker (inner maximizer) searches for worst-case perturbations $\delta$ in the hidden state $h_f$ of forget-set data, maximizing the unlearning loss and attempting to recover “forgotten” knowledge.

A curriculum-based gating variable $G_t \in \{0,1\}$ controls invocation of the attacker. The initial training phase (warm-up, $G_t=0$ ) excludes adversarial perturbation. Subsequently, the attack phase ( $\mathcal{D}_f$ 0) is triggered only once the $\mathcal{D}_f$ 1 gradient norm of the unlearning loss with respect to $\mathcal{D}_f$ 2 falls below a threshold $\mathcal{D}_f$ 3, preventing premature instability and oscillations.

3. Mathematical Formalism

Let $\mathcal{D}_f$ 4 denote the forget set and $\mathcal{D}_f$ 5 the retain set. For input $\mathcal{D}_f$ 6 with label $\mathcal{D}_f$ 7, the hidden representations at layer $\mathcal{D}_f$ 8 are $\mathcal{D}_f$ 9 and $\mathcal{D}_r$ 0.

The core losses are: $\mathcal{D}_r$ 1

$\mathcal{D}_r$ 2

The unified unlearning objective with AO regularization is: $\mathcal{D}_r$ 3

The min–max formulation is: $\mathcal{D}_r$ 4 where the attacker optimizes $\mathcal{D}_r$ 5 (via projected gradient descent) to maximize erasure vulnerability, while the defender updates $\mathcal{D}_r$ 6 to minimize unlearning loss.

The gating mechanism is defined as

$\mathcal{D}_r$ 7

Inner maximization is only performed if $\mathcal{D}_r$ 8.

4. Adaptive Orthogonality for Gradient Conflict Resolution

The AO (Adaptive Orthogonality) module penalizes destructive interference between the gradients of forgetting and retention losses. Given

$\mathcal{D}_r$ 9

the AO penalty is

$\theta$ 0

or equivalently,

$\theta$ 1

where $\theta$ 2 projects $\theta$ 3 onto the subspace orthogonal to $\theta$ 4. This regularization is incorporated into the overall loss with dynamic weight $\theta$ 5, adaptively mitigating geometric gradient conflict and stabilizing multi-objective optimization.

5. Training Algorithm and Curriculum

AGT $\theta$ 6 proceeds as follows:

Warm-up Phase ( $\theta$ 7):
- Iterate for $\theta$ 8 steps with no inner adversarial maximization.
- Only perform standard unlearning updates (removal and retention losses, AO regularizer).
Conditional Adversarial Phase ( $\theta$ 9):
- When $\mathcal{D}_f$ 0, identify the worst-case $\mathcal{D}_f$ 1 via $\mathcal{D}_f$ 2-step PGD:
$\mathcal{D}_f$ 3

Update $\mathcal{D}_f$ 4 on total loss including AO.

Algorithmic outline:

$\mathcal{D}_r$ 0

The curriculum initially restricts adversarial intervention, shifting to adversarial gating only as the retention–forgetting loss landscape stabilizes.

6. Empirical Evaluation and Comparison

AGT $\mathcal{D}_f$ 5 has been evaluated on several unlearning-critical tasks and benchmarks, including TOFU (fictional author biographies), MUSE (copyright removal), and WMDP (hazardous cybersecurity Q&A). Key evaluation metrics:

Unlearning Efficacy: Forget Quality, Knowledge Unlearning Ratio (KUR), ROUGE scores, WMDP accuracy.
Utility Preservation: Model Utility Score, Fluency, MMLU accuracy.
Privacy Leakage: Privacy Leakage Ratio (PLR), membership-inference AUC.

Summary of results (TOFU dataset, LLaMA-2-7B-chat, mean of three runs):

Method	Forget Quality ↑	KUR ↓	Model Utility ↑	Fluency ↑	PLR
AGT $\mathcal{D}_f$ 6	‒9.43	0.01	0.59	0.90	0.53
LAT	‒12.50	0.05	0.41	0.70	0.55
RMU	‒14.20	0.14	0.45	0.76	0.59
NPO	‒19.78	0.30	0.00	0.02	0.40

On WMDP-Cyber (Zephyr-7B-β model), AGT $\mathcal{D}_f$ 7 reduces hazardous Q&A accuracy from 44.0 (target) to 25.3, while retaining MMLU = 58.3. Ablation studies indicate that removing AO causes utility to drop (0.59→0.39), removing AGT degrades forget quality (‒9.43→‒31.59), and eliminating gating increases KUR (0.01→0.60) (Li et al., 2 Feb 2026).

7. Discussion, Limitations, and Future Directions

Adversarial gating exposes the model to worst-case latent perturbations only once parameter evolution has stabilized, mitigating overfitting and catastrophic collapse. The formulated min–max game flattens the loss landscape in directions susceptible to data resurrection, increasing unlearning robustness against latent space and quantization attacks. The AO component promotes gradient orthogonality, thus reducing optimization interference between forgetting and retention objectives.

Primary limitations include the added computational burden from inner-loop adversarial optimization and open questions of scalability to extremely large models ( $\mathcal{D}_f$ 8100B parameters). Future work may investigate alternatives for gating semantics (e.g., curvature-based triggers), dynamic budget selection for adversarial attacks, and lightweight approximations of inner maximization.

In summary, AGT $\mathcal{D}_f$ 9 operationalizes a rigorous, curriculum-driven, adversarially robust unlearning framework that delivers state-of-the-art knowledge erasure (KUR ≈ 0.01) while retaining high generalization (MMLU ≈ 58.3, Fluency ≈ 0.90) and resilience against both quantization- and re-learning-based recovery (Li et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adversarial Gating Training (AGT).