Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal Lipschitz-Guided Attacks

Updated 8 February 2026
  • Temporal Lipschitz-Guided Attacks (TLGA) are a class of white-box attacks that leverage local Lipschitz constants to pinpoint sensitive directions in video MoE architectures.
  • The approach dynamically adjusts perturbations across video frames and distinguishes between router-only (TLGA-R) and joint attacks (J-TLGA) for targeted disruption.
  • Empirical evaluations show TLGA drastically reduces robust accuracy, underscoring the need for advanced defenses like Joint Temporal Lipschitz Adversarial Training (J-TLAT).

Temporal Lipschitz-Guided Attacks (TLGA) are a class of white-box adversarial attacks specifically formulated to target the vulnerabilities of video Mixture-of-Experts (MoE) architectures. Unlike conventional attack paradigms that treat MoE models as monolithic entities, TLGA directly exploits the dynamic, sparse-routing behavior characteristic of these models, leveraging the local Lipschitz constant to identify and perturb the steepest, most sensitive directions in the input space. The method adapts the attack strength temporally by modulating perturbations across the video’s frames, systematically exposing both independent and collaborative weaknesses in the modular structure of video MoE systems (Wang et al., 1 Feb 2026).

1. Threat Model and Lipschitz-Guided Exploitation

The TLGA framework operates under a standard white-box threat model, wherein the adversary possesses detailed knowledge of the MoE’s parameters, architectural details—including routers and expert modules—and access to the model’s gradients. The attack goal is to craft an p\ell_p-bounded perturbation δ\delta (δpϵ\|\delta\|_p \leq \epsilon) that induces a misclassification. In the video MoE context, each input video xRC×T×H×Wx \in \mathbb{R}^{C \times T \times H \times W} passes through a lightweight router R(x)R(x), which selects a small, dynamic subset from a pool of MM expert modules {E1,,EM}\{E_1, \ldots, E_M\}, yielding logits via F(x)=i=1Mwi(x)Ei(x)F(x) = \sum_{i=1}^M w_i(x) E_i(x) with routing weights wi(x)=Ri(x)w_i(x) = R_i(x). The temporal aspect (TT frames) of video inputs introduces additional attack surface: perturbations can accumulate or be adapted frame-wise for maximal effect.

Central to TLGA is the incorporation of a local Lipschitz constant as an attack guidance mechanism. For any (sub)network g()g(\cdot), the local Lipschitz bound is estimated via

Lt(g,x,δ)=g(x+δ)g(x)2δ2L_t(g, x, \delta) = \frac{\|g(x+\delta)-g(x)\|_2}{\|\delta\|_2}

A large LtL_t signifies high sensitivity, indicating directions where small input changes cause large output variations—these are preferentially targeted to maximize adversarial vulnerability (Wang et al., 1 Feb 2026).

2. Formalization of the Temporal Lipschitz-Guided Attack

The fundamental optimization underlying TLGA interpolates a cross-entropy (CE) loss with a surrogate for the local Lipschitz constant. For the full MoE model FF with ground-truth label yy, the core objective is:

maxδCE(F(x+δ),y)+αF(x+δ)F(x)22x+δx22s.t. δpϵ\max_{\delta} \quad \ell_{\text{CE}}(F(x+\delta), y) + \alpha \cdot \frac{\|F(x+\delta) - F(x)\|_2^2}{\|x+\delta - x\|_2^2} \quad \text{s.t. } \|\delta\|_p \leq \epsilon

Here, the second term is a finite-difference estimate of the Lipschitz constant, using mean-squared error (MSE) in both the output and input spaces. Temporal adaptation is achieved by using frame-specific gradients and step sizes: The update at timestep tt is regulated with momentum (β\beta) and a dynamic step-size at=ηlog(1+Vt+1)a_t = \eta \cdot \log(1 + V_{t+1}), where Vt+1=βVt+xtotal(t)2V_{t+1} = \beta V_t + \|\nabla_{x} \ell_{\text{total}(t)}\|_2. Each perturbation is then projected to enforce the \ell_\infty (or other p\ell_p) constraint:

δt+1=Projϵ[δt+atsign(xtotal(t))]\delta_{t+1} = \text{Proj}_{\|\cdot\|_\infty \leq \epsilon}\left[\delta_t + a_t \cdot \text{sign}(\nabla_x \ell_{\text{total}(t)})\right]

This selective, temporally-adaptive optimization directly targets the most sensitive frames and directions in the input video (Wang et al., 1 Feb 2026).

3. Router-Only TLGA Versus Joint TLGA (J-TLGA)

Two principal attack variants are considered:

TLGA-Router (TLGA-R): The perturbation is crafted to destabilize the router exclusively. Specifically, g=Rg = R and the CE-loss references the routing decision yRy_R made on clean input. Additionally, the attacker can “guide” the router towards its least reliable expert (determined by lowest clean-confidence, yRy_R^*), extending the objective:

maxδ CE(R(x+δ),yR)+αR(x+δ)R(x)22δ22γCE(R(x+δ),yR)\max_{\delta} \ \ell_{\text{CE}}(R(x+\delta), y_R) + \alpha \frac{\|R(x+\delta) - R(x)\|_2^2}{\|\delta\|_2^2} - \gamma \ell_{\text{CE}}(R(x+\delta), y_R^*)

This drives the router to collapse onto suboptimal pathways.

Joint TLGA (J-TLGA): Both the router and the expert modules are simultaneously attacked. Defining MOE(δ)=CE(F(x+δ),y)+αLt(F,x,δ)\ell_{\text{MOE}}(\delta) = \ell_{\text{CE}}(F(x+\delta), y) + \alpha L_t(F, x, \delta) and R(δ)\ell_{R}(\delta) as above, the joint objective is:

maxδ MOE(δ)+λR(δ)s.t. δpϵ\max_{\delta} \ \ell_{\text{MOE}}(\delta) + \lambda \ell_{R}(\delta) \quad \text{s.t. } \|\delta\|_p \leq \epsilon

λ\lambda modulates the trade-off between overall destabilization versus targeted routing collapse. Empirically, J-TLGA is substantially more destructive than targeting individual components (Wang et al., 1 Feb 2026).

4. Empirical Evaluation and Comparative Robustness

Extensive tests were conducted on UCF-101 and HMDB-51 action recognition datasets, evaluating MoE variants built upon 3D ResNet-18, TSM, SlowFast, and R(2+1)D expert backbones. The router selected four experts per clip. TLGA performance was benchmarked against baseline attacks FGSM, PGD, AutoAttack, and TT.

Robust accuracy, defined as percentage of correct predictions under attack, was reported for ϵ{8,10,12,14}/255\epsilon \in \{8, 10, 12, 14\}/255. On 3D ResNet-18 MoE trained with standard adversarial training (AT), the following excerpted results are illustrative:

ϵ\epsilon PGD-R TLGA-R J-TLGA
8/255 42.20% 18.13% 4.95%
14/255 39.34% 15.82% 2.54%

TLGA-R reduces robust accuracy by over 20 percentage points compared to PGD (from ~40% to ~16% at ϵ=14/255\epsilon = 14/255), while J-TLGA further collapses it to approximately 2.5%. Trends persist across different backbones and on HMDB-51, with J-TLGA driving robust accuracy below 5%. Additionally, empirical local Lipschitz estimates (“Lips-R”, “Lips-J”) increase by orders of magnitude under TLGA, confirming the targeted sensitivity. In black-box (transfer) regimes, adversarial samples from J-TLGA exhibit superior transferability to unseen MoE models versus those generated by FGSM or PGD (Wang et al., 1 Feb 2026).

5. Defense via Joint Temporal Lipschitz Adversarial Training (J-TLAT)

Standard adversarial training methods, which typically inject PGD examples network-wide, fail to address the disjoint vulnerabilities revealed by TLGA in routers and experts. Joint Temporal Lipschitz Adversarial Training (J-TLAT) hierarchically addresses this gap via the following three-step min-max scheme per training epoch:

  1. Router Defense:

minθRmaxδRϵCE(RθR(x+δR),RθR(x))+αRRθR(x+δR)RθR(x)22δR22\min_{\theta_R} \max_{\|\delta_R\| \leq \epsilon} \ell_{\text{CE}}(R_{\theta_R}(x+\delta_R), R_{\theta_R}(x)) + \alpha_R \frac{\|R_{\theta_R}(x+\delta_R) - R_{\theta_R}(x)\|_2^2}{\|\delta_R\|_2^2}

  1. Expert Defense: Identify the two weakest experts I=Top-2(RθR(x+δR))I = \text{Top-2}(R_{\theta_R}(x+\delta_R)):

minθEmaxδiϵiI[CE(EθE,i(x+δi),y)+αEEθE,i(x+δi)EθE,i(x)22δi22]\min_{\theta_E} \max_{\|\delta_i\| \leq \epsilon} \sum_{i \in I} \left[ \ell_{\text{CE}}(E_{\theta_E, i}(x+\delta_i), y) + \alpha_E \frac{\|E_{\theta_E,i}(x+\delta_i) - E_{\theta_E,i}(x)\|_2^2}{\|\delta_i\|_2^2} \right]

  1. Overall MoE Defense:

minΘmaxδϵCE(FΘ(x+δ),y)+αMFΘ(x+δ)FΘ(x)22δ22\min_{\Theta} \max_{\|\delta\| \leq \epsilon} \ell_{\text{CE}}(F_{\Theta}(x+\delta), y) + \alpha_M \frac{\|F_{\Theta}(x+\delta) - F_{\Theta}(x)\|_2^2}{\|\delta\|_2^2}

Here, θR\theta_R, θE\theta_E, and Θ\Theta denote router, expert, and full-model parameters, respectively.

Empirically, J-TLAT achieves considerably higher robust accuracy under strong joint attacks (e.g., ≈22% vs. ≤2.5% for J-TLGA at ϵ=14/255\epsilon = 14/255), outperforming all baselines by at least 20 percentage points on UCF-101. Additionally, J-TLAT retains the intrinsic computational efficiency of MoE (inference cost reduced by over 60% relative to dense models), and produces a router with an order-of-magnitude smaller empirical Lipschitz constant (≈0.8 compared to hundreds under naïve AT), explaining the improved resilience observed against TLGA (Wang et al., 1 Feb 2026).

6. Conclusion and Implications

Temporal Lipschitz-Guided Attacks reveal and exploit the modular weaknesses of video MoE architectures by maximizing a finite-difference local Lipschitz surrogate across time, systematically exposing vulnerabilities particularly in the router and via collaborative interactions with expert modules. Joint attacks (J-TLGA) are markedly more devastating than separate component attacks, substantiating the importance of coordinated defense strategies. The introduction of Joint Temporal Lipschitz Adversarial Training (J-TLAT) delivers hierarchical, component-wise robustness improvements, preserving computational efficiency while substantially mitigating adversarial susceptibility. This suggests that modularity and sparse routing in MoE architectures, while beneficial for efficiency and performance, necessitate sophisticated, component-aware adversarial defenses for robust deployment in real-world video understanding tasks (Wang et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Lipschitz-Guided Attacks (TLGA).