Temporal Lipschitz-Guided Attacks

Updated 8 February 2026

Temporal Lipschitz-Guided Attacks (TLGA) are a class of white-box attacks that leverage local Lipschitz constants to pinpoint sensitive directions in video MoE architectures.
The approach dynamically adjusts perturbations across video frames and distinguishes between router-only (TLGA-R) and joint attacks (J-TLGA) for targeted disruption.
Empirical evaluations show TLGA drastically reduces robust accuracy, underscoring the need for advanced defenses like Joint Temporal Lipschitz Adversarial Training (J-TLAT).

Temporal Lipschitz-Guided Attacks (TLGA) are a class of white-box adversarial attacks specifically formulated to target the vulnerabilities of video Mixture-of-Experts (MoE) architectures. Unlike conventional attack paradigms that treat MoE models as monolithic entities, TLGA directly exploits the dynamic, sparse-routing behavior characteristic of these models, leveraging the local Lipschitz constant to identify and perturb the steepest, most sensitive directions in the input space. The method adapts the attack strength temporally by modulating perturbations across the video’s frames, systematically exposing both independent and collaborative weaknesses in the modular structure of video MoE systems (Wang et al., 1 Feb 2026).

1. Threat Model and Lipschitz-Guided Exploitation

The TLGA framework operates under a standard white-box threat model, wherein the adversary possesses detailed knowledge of the MoE’s parameters, architectural details—including routers and expert modules—and access to the model’s gradients. The attack goal is to craft an $\ell_p$ -bounded perturbation $\delta$ ( $\|\delta\|_p \leq \epsilon$ ) that induces a misclassification. In the video MoE context, each input video $x \in \mathbb{R}^{C \times T \times H \times W}$ passes through a lightweight router $R(x)$ , which selects a small, dynamic subset from a pool of $M$ expert modules $\{E_1, \ldots, E_M\}$ , yielding logits via $F(x) = \sum_{i=1}^M w_i(x) E_i(x)$ with routing weights $w_i(x) = R_i(x)$ . The temporal aspect ( $T$ frames) of video inputs introduces additional attack surface: perturbations can accumulate or be adapted frame-wise for maximal effect.

Central to TLGA is the incorporation of a local Lipschitz constant as an attack guidance mechanism. For any (sub)network $g(\cdot)$ , the local Lipschitz bound is estimated via

$L_t(g, x, \delta) = \frac{\|g(x+\delta)-g(x)\|_2}{\|\delta\|_2}$

A large $L_t$ signifies high sensitivity, indicating directions where small input changes cause large output variations—these are preferentially targeted to maximize adversarial vulnerability (Wang et al., 1 Feb 2026).

2. Formalization of the Temporal Lipschitz-Guided Attack

The fundamental optimization underlying TLGA interpolates a cross-entropy (CE) loss with a surrogate for the local Lipschitz constant. For the full MoE model $F$ with ground-truth label $y$ , the core objective is:

$\max_{\delta} \quad \ell_{\text{CE}}(F(x+\delta), y) + \alpha \cdot \frac{\|F(x+\delta) - F(x)\|_2^2}{\|x+\delta - x\|_2^2} \quad \text{s.t. } \|\delta\|_p \leq \epsilon$

Here, the second term is a finite-difference estimate of the Lipschitz constant, using mean-squared error (MSE) in both the output and input spaces. Temporal adaptation is achieved by using frame-specific gradients and step sizes: The update at timestep $t$ is regulated with momentum ( $\beta$ ) and a dynamic step-size $a_t = \eta \cdot \log(1 + V_{t+1})$ , where $V_{t+1} = \beta V_t + \|\nabla_{x} \ell_{\text{total}(t)}\|_2$ . Each perturbation is then projected to enforce the $\ell_\infty$ (or other $\ell_p$ ) constraint:

$\delta_{t+1} = \text{Proj}_{\|\cdot\|_\infty \leq \epsilon}\left[\delta_t + a_t \cdot \text{sign}(\nabla_x \ell_{\text{total}(t)})\right]$

This selective, temporally-adaptive optimization directly targets the most sensitive frames and directions in the input video (Wang et al., 1 Feb 2026).

3. Router-Only TLGA Versus Joint TLGA (J-TLGA)

Two principal attack variants are considered:

TLGA-Router (TLGA-R): The perturbation is crafted to destabilize the router exclusively. Specifically, $g = R$ and the CE-loss references the routing decision $y_R$ made on clean input. Additionally, the attacker can “guide” the router towards its least reliable expert (determined by lowest clean-confidence, $y_R^*$ ), extending the objective:

$\max_{\delta} \ \ell_{\text{CE}}(R(x+\delta), y_R) + \alpha \frac{\|R(x+\delta) - R(x)\|_2^2}{\|\delta\|_2^2} - \gamma \ell_{\text{CE}}(R(x+\delta), y_R^*)$

This drives the router to collapse onto suboptimal pathways.

Joint TLGA (J-TLGA): Both the router and the expert modules are simultaneously attacked. Defining $\ell_{\text{MOE}}(\delta) = \ell_{\text{CE}}(F(x+\delta), y) + \alpha L_t(F, x, \delta)$ and $\ell_{R}(\delta)$ as above, the joint objective is:

$\max_{\delta} \ \ell_{\text{MOE}}(\delta) + \lambda \ell_{R}(\delta) \quad \text{s.t. } \|\delta\|_p \leq \epsilon$

$\lambda$ modulates the trade-off between overall destabilization versus targeted routing collapse. Empirically, J-TLGA is substantially more destructive than targeting individual components (Wang et al., 1 Feb 2026).

4. Empirical Evaluation and Comparative Robustness

Extensive tests were conducted on UCF-101 and HMDB-51 action recognition datasets, evaluating MoE variants built upon 3D ResNet-18, TSM, SlowFast, and R(2+1)D expert backbones. The router selected four experts per clip. TLGA performance was benchmarked against baseline attacks FGSM, PGD, AutoAttack, and TT.

Robust accuracy, defined as percentage of correct predictions under attack, was reported for $\epsilon \in \{8, 10, 12, 14\}/255$ . On 3D ResNet-18 MoE trained with standard adversarial training (AT), the following excerpted results are illustrative:

$\epsilon$	PGD-R	TLGA-R	J-TLGA
8/255	42.20%	18.13%	4.95%
14/255	39.34%	15.82%	2.54%

TLGA-R reduces robust accuracy by over 20 percentage points compared to PGD (from ~40% to ~16% at $\epsilon = 14/255$ ), while J-TLGA further collapses it to approximately 2.5%. Trends persist across different backbones and on HMDB-51, with J-TLGA driving robust accuracy below 5%. Additionally, empirical local Lipschitz estimates (“Lips-R”, “Lips-J”) increase by orders of magnitude under TLGA, confirming the targeted sensitivity. In black-box (transfer) regimes, adversarial samples from J-TLGA exhibit superior transferability to unseen MoE models versus those generated by FGSM or PGD (Wang et al., 1 Feb 2026).

5. Defense via Joint Temporal Lipschitz Adversarial Training (J-TLAT)

Standard adversarial training methods, which typically inject PGD examples network-wide, fail to address the disjoint vulnerabilities revealed by TLGA in routers and experts. Joint Temporal Lipschitz Adversarial Training (J-TLAT) hierarchically addresses this gap via the following three-step min-max scheme per training epoch:

Router Defense:

$\min_{\theta_R} \max_{\|\delta_R\| \leq \epsilon} \ell_{\text{CE}}(R_{\theta_R}(x+\delta_R), R_{\theta_R}(x)) + \alpha_R \frac{\|R_{\theta_R}(x+\delta_R) - R_{\theta_R}(x)\|_2^2}{\|\delta_R\|_2^2}$

Expert Defense: Identify the two weakest experts $I = \text{Top-2}(R_{\theta_R}(x+\delta_R))$ :

$\min_{\theta_E} \max_{\|\delta_i\| \leq \epsilon} \sum_{i \in I} \left[ \ell_{\text{CE}}(E_{\theta_E, i}(x+\delta_i), y) + \alpha_E \frac{\|E_{\theta_E,i}(x+\delta_i) - E_{\theta_E,i}(x)\|_2^2}{\|\delta_i\|_2^2} \right]$

Overall MoE Defense:

$\min_{\Theta} \max_{\|\delta\| \leq \epsilon} \ell_{\text{CE}}(F_{\Theta}(x+\delta), y) + \alpha_M \frac{\|F_{\Theta}(x+\delta) - F_{\Theta}(x)\|_2^2}{\|\delta\|_2^2}$

Here, $\theta_R$ , $\theta_E$ , and $\Theta$ denote router, expert, and full-model parameters, respectively.

Empirically, J-TLAT achieves considerably higher robust accuracy under strong joint attacks (e.g., ≈22% vs. ≤2.5% for J-TLGA at $\epsilon = 14/255$ ), outperforming all baselines by at least 20 percentage points on UCF-101. Additionally, J-TLAT retains the intrinsic computational efficiency of MoE (inference cost reduced by over 60% relative to dense models), and produces a router with an order-of-magnitude smaller empirical Lipschitz constant (≈0.8 compared to hundreds under naïve AT), explaining the improved resilience observed against TLGA (Wang et al., 1 Feb 2026).

6. Conclusion and Implications

Temporal Lipschitz-Guided Attacks reveal and exploit the modular weaknesses of video MoE architectures by maximizing a finite-difference local Lipschitz surrogate across time, systematically exposing vulnerabilities particularly in the router and via collaborative interactions with expert modules. Joint attacks (J-TLGA) are markedly more devastating than separate component attacks, substantiating the importance of coordinated defense strategies. The introduction of Joint Temporal Lipschitz Adversarial Training (J-TLAT) delivers hierarchical, component-wise robustness improvements, preserving computational efficiency while substantially mitigating adversarial susceptibility. This suggests that modularity and sparse routing in MoE architectures, while beneficial for efficiency and performance, necessitate sophisticated, component-aware adversarial defenses for robust deployment in real-world video understanding tasks (Wang et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Lipschitz-Guided Attacks (TLGA).