Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adversarial Gating Training (AGT)

Updated 9 February 2026
  • Adversarial Gating Training (AGT) is a machine unlearning paradigm that reformulates data erasure as a latent min–max game to protect against adversarial recovery while preserving model utility.
  • It employs a curriculum-based gating mechanism that triggers adversarial attacks only after model gradients stabilize, thus preventing premature instability and catastrophic forgetting.
  • AGT integrates Adaptive Orthogonality to resolve gradient conflicts between forgetting and retention losses, enhancing robustness against recovery via quantization or rapid re-learning.

Adversarial Gating Training (AGT) is a machine unlearning paradigm designed to achieve robust knowledge erasure while maintaining retention utility in LLMs and related deep neural networks. AGT reconfigures unlearning as a latent-space min–max game, incorporating adversarial attacks within the model’s internal representations and employing a curriculum-based gating mechanism to maximize erasure resistance without catastrophic forgetting. Most recently, AGT is unified with Adaptive Orthogonality (AO) in the AGTAO^{AO} framework to dynamically resolve geometric gradient conflicts, enabling stable unlearning even in regimes subject to adversarial recovery threats such as jailbreaks, quantization, or rapid re-learning (Li et al., 2 Feb 2026).

1. Motivation and Problem Definition

Current LLMs risk unintentional memorization or leakage of sensitive training data. Standard unlearning strategies face a fundamental dilemma:

  • Catastrophic Forgetting: Directly pushing model behavior away from a forget set Df\mathcal{D}_f can indiscriminately degrade utility across all inputs, impairing performance on necessary retention data Dr\mathcal{D}_r.
  • Superficial Forgetting: Conservative or superficial approaches may mask undesired knowledge without fully removing it, leaving models open to adversarial recovery via prompt engineering, latent-space perturbations, model quantization, or rapid re-learning.

Adversarial Gating Training is motivated by the need to jointly optimize for robust erasure—such that hidden activation perturbations, jailbreak prompts, or quantization-based attacks cannot resurrect erased information—while preserving downstream model utility.

2. AGT Methodology: Latent-Space Min–Max Game and Gating Mechanism

AGT formulates unlearning as a two-player latent min–max game:

  • The defender (outer minimizer) updates model parameters θ\theta to erase information associated with the forget set Df\mathcal{D}_f while preserving performance on the retain set Dr\mathcal{D}_r.
  • The attacker (inner maximizer) searches for worst-case perturbations δ\delta in the hidden state hfh_f of forget-set data, maximizing the unlearning loss and attempting to recover “forgotten” knowledge.

A curriculum-based gating variable Gt{0,1}G_t \in \{0,1\} controls invocation of the attacker. The initial training phase (warm-up, Gt=0G_t=0) excludes adversarial perturbation. Subsequently, the attack phase (Df\mathcal{D}_f0) is triggered only once the Df\mathcal{D}_f1 gradient norm of the unlearning loss with respect to Df\mathcal{D}_f2 falls below a threshold Df\mathcal{D}_f3, preventing premature instability and oscillations.

3. Mathematical Formalism

Let Df\mathcal{D}_f4 denote the forget set and Df\mathcal{D}_f5 the retain set. For input Df\mathcal{D}_f6 with label Df\mathcal{D}_f7, the hidden representations at layer Df\mathcal{D}_f8 are Df\mathcal{D}_f9 and Dr\mathcal{D}_r0.

The core losses are: Dr\mathcal{D}_r1

Dr\mathcal{D}_r2

The unified unlearning objective with AO regularization is: Dr\mathcal{D}_r3

The min–max formulation is: Dr\mathcal{D}_r4 where the attacker optimizes Dr\mathcal{D}_r5 (via projected gradient descent) to maximize erasure vulnerability, while the defender updates Dr\mathcal{D}_r6 to minimize unlearning loss.

The gating mechanism is defined as

Dr\mathcal{D}_r7

Inner maximization is only performed if Dr\mathcal{D}_r8.

4. Adaptive Orthogonality for Gradient Conflict Resolution

The AO (Adaptive Orthogonality) module penalizes destructive interference between the gradients of forgetting and retention losses. Given

Dr\mathcal{D}_r9

the AO penalty is

θ\theta0

or equivalently,

θ\theta1

where θ\theta2 projects θ\theta3 onto the subspace orthogonal to θ\theta4. This regularization is incorporated into the overall loss with dynamic weight θ\theta5, adaptively mitigating geometric gradient conflict and stabilizing multi-objective optimization.

5. Training Algorithm and Curriculum

AGTθ\theta6 proceeds as follows:

  1. Warm-up Phase (θ\theta7):
    • Iterate for θ\theta8 steps with no inner adversarial maximization.
    • Only perform standard unlearning updates (removal and retention losses, AO regularizer).
  2. Conditional Adversarial Phase (θ\theta9):

    • When Df\mathcal{D}_f0, identify the worst-case Df\mathcal{D}_f1 via Df\mathcal{D}_f2-step PGD:

    Df\mathcal{D}_f3

  • Update Df\mathcal{D}_f4 on total loss including AO.

Algorithmic outline:

Dr\mathcal{D}_r0

The curriculum initially restricts adversarial intervention, shifting to adversarial gating only as the retention–forgetting loss landscape stabilizes.

6. Empirical Evaluation and Comparison

AGTDf\mathcal{D}_f5 has been evaluated on several unlearning-critical tasks and benchmarks, including TOFU (fictional author biographies), MUSE (copyright removal), and WMDP (hazardous cybersecurity Q&A). Key evaluation metrics:

  • Unlearning Efficacy: Forget Quality, Knowledge Unlearning Ratio (KUR), ROUGE scores, WMDP accuracy.
  • Utility Preservation: Model Utility Score, Fluency, MMLU accuracy.
  • Privacy Leakage: Privacy Leakage Ratio (PLR), membership-inference AUC.

Summary of results (TOFU dataset, LLaMA-2-7B-chat, mean of three runs):

Method Forget Quality ↑ KUR ↓ Model Utility ↑ Fluency ↑ PLR
AGTDf\mathcal{D}_f6 ‒9.43 0.01 0.59 0.90 0.53
LAT ‒12.50 0.05 0.41 0.70 0.55
RMU ‒14.20 0.14 0.45 0.76 0.59
NPO ‒19.78 0.30 0.00 0.02 0.40

On WMDP-Cyber (Zephyr-7B-β model), AGTDf\mathcal{D}_f7 reduces hazardous Q&A accuracy from 44.0 (target) to 25.3, while retaining MMLU = 58.3. Ablation studies indicate that removing AO causes utility to drop (0.59→0.39), removing AGT degrades forget quality (‒9.43→‒31.59), and eliminating gating increases KUR (0.01→0.60) (Li et al., 2 Feb 2026).

7. Discussion, Limitations, and Future Directions

Adversarial gating exposes the model to worst-case latent perturbations only once parameter evolution has stabilized, mitigating overfitting and catastrophic collapse. The formulated min–max game flattens the loss landscape in directions susceptible to data resurrection, increasing unlearning robustness against latent space and quantization attacks. The AO component promotes gradient orthogonality, thus reducing optimization interference between forgetting and retention objectives.

Primary limitations include the added computational burden from inner-loop adversarial optimization and open questions of scalability to extremely large models (Df\mathcal{D}_f8100B parameters). Future work may investigate alternatives for gating semantics (e.g., curvature-based triggers), dynamic budget selection for adversarial attacks, and lightweight approximations of inner maximization.

In summary, AGTDf\mathcal{D}_f9 operationalizes a rigorous, curriculum-driven, adversarially robust unlearning framework that delivers state-of-the-art knowledge erasure (KUR ≈ 0.01) while retaining high generalization (MMLU ≈ 58.3, Fluency ≈ 0.90) and resilience against both quantization- and re-learning-based recovery (Li et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adversarial Gating Training (AGT).