Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adversarial Gating Training (AGT)

Updated 9 February 2026
  • Adversarial Gating Training (AGT) is a machine unlearning paradigm that reformulates data erasure as a latent min–max game to protect against adversarial recovery while preserving model utility.
  • It employs a curriculum-based gating mechanism that triggers adversarial attacks only after model gradients stabilize, thus preventing premature instability and catastrophic forgetting.
  • AGT integrates Adaptive Orthogonality to resolve gradient conflicts between forgetting and retention losses, enhancing robustness against recovery via quantization or rapid re-learning.

Adversarial Gating Training (AGT) is a machine unlearning paradigm designed to achieve robust knowledge erasure while maintaining retention utility in LLMs and related deep neural networks. AGT reconfigures unlearning as a latent-space min–max game, incorporating adversarial attacks within the model’s internal representations and employing a curriculum-based gating mechanism to maximize erasure resistance without catastrophic forgetting. Most recently, AGT is unified with Adaptive Orthogonality (AO) in the AGTAO^{AO} framework to dynamically resolve geometric gradient conflicts, enabling stable unlearning even in regimes subject to adversarial recovery threats such as jailbreaks, quantization, or rapid re-learning (Li et al., 2 Feb 2026).

1. Motivation and Problem Definition

Current LLMs risk unintentional memorization or leakage of sensitive training data. Standard unlearning strategies face a fundamental dilemma:

  • Catastrophic Forgetting: Directly pushing model behavior away from a forget set Df\mathcal{D}_f can indiscriminately degrade utility across all inputs, impairing performance on necessary retention data Dr\mathcal{D}_r.
  • Superficial Forgetting: Conservative or superficial approaches may mask undesired knowledge without fully removing it, leaving models open to adversarial recovery via prompt engineering, latent-space perturbations, model quantization, or rapid re-learning.

Adversarial Gating Training is motivated by the need to jointly optimize for robust erasure—such that hidden activation perturbations, jailbreak prompts, or quantization-based attacks cannot resurrect erased information—while preserving downstream model utility.

2. AGT Methodology: Latent-Space Min–Max Game and Gating Mechanism

AGT formulates unlearning as a two-player latent min–max game:

  • The defender (outer minimizer) updates model parameters θ\theta to erase information associated with the forget set Df\mathcal{D}_f while preserving performance on the retain set Dr\mathcal{D}_r.
  • The attacker (inner maximizer) searches for worst-case perturbations δ\delta in the hidden state hfh_f of forget-set data, maximizing the unlearning loss and attempting to recover “forgotten” knowledge.

A curriculum-based gating variable Gt{0,1}G_t \in \{0,1\} controls invocation of the attacker. The initial training phase (warm-up, Gt=0G_t=0) excludes adversarial perturbation. Subsequently, the attack phase (Gt=1G_t=1) is triggered only once the L2L_2 gradient norm of the unlearning loss with respect to θ\theta falls below a threshold τgrad\tau_{\mathrm{grad}}, preventing premature instability and oscillations.

3. Mathematical Formalism

Let Df\mathcal{D}_f denote the forget set and Dr\mathcal{D}_r the retain set. For input xx with label yy, the hidden representations at layer \ell are hf=f(x;θ)h_f=f_\ell(x;\theta) and hr=f(x;θ)h_r=f_\ell(x;\theta).

The core losses are: Lretain(hr)=E(x,y)Dr[logp(yhr)],\mathcal{L}_{\mathrm{retain}}(h_r) = \mathbb{E}_{(x,y)\sim\mathcal{D}_r}\bigl[-\log p(y\mid h_r)\bigr],

Lforget(hf)=2βE(x,y)Dflogσ(βylogp(yhf)α).\mathcal{L}_{\mathrm{forget}}(h_f) = -\frac{2}{\beta}\mathbb{E}_{(x,y)\sim\mathcal{D}_f} \log \sigma\left(-\frac{\beta}{|y|} \log p(y\mid h_f) - \alpha\right).

The unified unlearning objective with AO regularization is: Lunlearn(θ,hf,hr)=Lforget(hf)+Lretain(hr)+λaoRAO(θ).\mathcal{L}_{\mathrm{unlearn}}(\theta,\,h_f,\,h_r) = \mathcal{L}_{\mathrm{forget}}(h_f) + \mathcal{L}_{\mathrm{retain}}(h_r) + \lambda_{\mathrm{ao}}\,\mathcal{R}_{\mathrm{AO}}(\theta).

The min–max formulation is: minθ  maxδpϵLunlearn ⁣(θ,hf+δ,hr),\min_{\theta}\;\max_{\|\delta\|_p\le\epsilon} \mathcal{L}_{\mathrm{unlearn}}\!\left(\theta,\,h_f+\delta,\,h_r\right), where the attacker optimizes δ\delta (via projected gradient descent) to maximize erasure vulnerability, while the defender updates θ\theta to minimize unlearning loss.

The gating mechanism is defined as

Gt={0,tNwarmup, I ⁣(θLunlearn2<τgrad),t>Nwarmup.G_t = \begin{cases} 0, & t \le N_{\mathrm{warmup}}, \ \mathbb{I}\!\left(\left\Vert\nabla_{\theta}\mathcal{L}_{\mathrm{unlearn}}\right\Vert_2 < \tau_{\mathrm{grad}}\right), & t > N_{\mathrm{warmup}}. \end{cases}

Inner maximization is only performed if Gt=1G_t = 1.

4. Adaptive Orthogonality for Gradient Conflict Resolution

The AO (Adaptive Orthogonality) module penalizes destructive interference between the gradients of forgetting and retention losses. Given

gf=θLforget,gr=θLretain,g_f = \nabla_\theta \mathcal{L}_{\mathrm{forget}}, \quad g_r = \nabla_\theta \mathcal{L}_{\mathrm{retain}},

the AO penalty is

RAO=I(gfgr<0)(1cos(gf,gr)2)γ,\mathcal{R}_{\mathrm{AO}} = \mathbb{I}(g_f \cdot g_r < 0) \left(\frac{1 - \cos(g_f,g_r)}{2}\right)^\gamma,

or equivalently,

LAO(t)=λao(t)Pgr(gf)2,\mathcal{L}_{\mathrm{AO}}(t) = \lambda_{\mathrm{ao}}(t)\|P_{g_r}^\perp(g_f)\|^2,

where PgrP_{g_r}^\perp projects gfg_f onto the subspace orthogonal to grg_r. This regularization is incorporated into the overall loss with dynamic weight λao(t)\lambda_{\mathrm{ao}}(t), adaptively mitigating geometric gradient conflict and stabilizing multi-objective optimization.

5. Training Algorithm and Curriculum

AGTAO^{AO} proceeds as follows:

  1. Warm-up Phase (Gt=0G_t=0):
    • Iterate for NwarmupN_{\mathrm{warmup}} steps with no inner adversarial maximization.
    • Only perform standard unlearning updates (removal and retention losses, AO regularizer).
  2. Conditional Adversarial Phase (Gt=1G_t=1):

    • When θLunlearn2<τgrad\|\nabla_{\theta}\mathcal{L}_{\mathrm{unlearn}}\|_2 < \tau_{\mathrm{grad}}, identify the worst-case δ\delta via KK-step PGD:

    δΠpϵ[δ+αsign(δLunlearn(θ,hf+δ,hr))]\delta \leftarrow \Pi_{||\cdot||_p \leq \epsilon} [\delta + \alpha \cdot \mathrm{sign}(\nabla_\delta \mathcal{L}_{\mathrm{unlearn}}(\theta, h_f+\delta, h_r))]

  • Update θ\theta on total loss including AO.

Algorithmic outline:

1
2
3
4
5
6
7
8
9
10
11
12
13
for step t in 1..T:
    Sample batches from D_f, D_r
    Compute losses L_f, L_r, gradients g_f, g_r
    Compute AO penalty R_AO
    G_t = 0 if t <= N_warmup else (||θ(L_f+L_r+λ_ao·R_AO)||_2 < τ_grad)
    if G_t == 1:
        δ = 0
        for k in 1..K:
            δ = Proj_{||.||_pε}[δ + α·sign(_δ L_unlearn(θ, h_f+δ, h_r))]
    else:
        δ = 0
    L = L_forget(h_f+δ) + L_retain(h_r) + λ_ao(t)·R_AO
    θ = θ - η·θ L

The curriculum initially restricts adversarial intervention, shifting to adversarial gating only as the retention–forgetting loss landscape stabilizes.

6. Empirical Evaluation and Comparison

AGTAO^{AO} has been evaluated on several unlearning-critical tasks and benchmarks, including TOFU (fictional author biographies), MUSE (copyright removal), and WMDP (hazardous cybersecurity Q&A). Key evaluation metrics:

  • Unlearning Efficacy: Forget Quality, Knowledge Unlearning Ratio (KUR), ROUGE scores, WMDP accuracy.
  • Utility Preservation: Model Utility Score, Fluency, MMLU accuracy.
  • Privacy Leakage: Privacy Leakage Ratio (PLR), membership-inference AUC.

Summary of results (TOFU dataset, LLaMA-2-7B-chat, mean of three runs):

Method Forget Quality ↑ KUR ↓ Model Utility ↑ Fluency ↑ PLR
AGTAO^{AO} ‒9.43 0.01 0.59 0.90 0.53
LAT ‒12.50 0.05 0.41 0.70 0.55
RMU ‒14.20 0.14 0.45 0.76 0.59
NPO ‒19.78 0.30 0.00 0.02 0.40

On WMDP-Cyber (Zephyr-7B-β model), AGTAO^{AO} reduces hazardous Q&A accuracy from 44.0 (target) to 25.3, while retaining MMLU = 58.3. Ablation studies indicate that removing AO causes utility to drop (0.59→0.39), removing AGT degrades forget quality (‒9.43→‒31.59), and eliminating gating increases KUR (0.01→0.60) (Li et al., 2 Feb 2026).

7. Discussion, Limitations, and Future Directions

Adversarial gating exposes the model to worst-case latent perturbations only once parameter evolution has stabilized, mitigating overfitting and catastrophic collapse. The formulated min–max game flattens the loss landscape in directions susceptible to data resurrection, increasing unlearning robustness against latent space and quantization attacks. The AO component promotes gradient orthogonality, thus reducing optimization interference between forgetting and retention objectives.

Primary limitations include the added computational burden from inner-loop adversarial optimization and open questions of scalability to extremely large models (\gg100B parameters). Future work may investigate alternatives for gating semantics (e.g., curvature-based triggers), dynamic budget selection for adversarial attacks, and lightweight approximations of inner maximization.

In summary, AGTAO^{AO} operationalizes a rigorous, curriculum-driven, adversarially robust unlearning framework that delivers state-of-the-art knowledge erasure (KUR ≈ 0.01) while retaining high generalization (MMLU ≈ 58.3, Fluency ≈ 0.90) and resilience against both quantization- and re-learning-based recovery (Li et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adversarial Gating Training (AGT).