LipNeXt: Scalable 1-Lipschitz Network

Updated 2 February 2026

LipNeXt is a deep learning architecture that guarantees exact 1-Lipschitzness, providing deterministic robustness for high-dimensional image classification.
It leverages constraint-free manifold optimization on the Stiefel manifold and convolution-free spatial mixing, enabling scalability to billions of parameters.
Empirical results show state-of-the-art clean and certified robust accuracy on benchmarks like CIFAR and ImageNet, with stable and efficient bfloat16 training.

LipNeXt is a deep learning architecture designed for efficient, deterministic robustness certification via tight Lipschitz control, notably scaling to billion-parameter regimes while ensuring exact 1-Lipschitzness throughout. Unlike prior certified models constrained by computational overhead or numerical instability, LipNeXt employs constraint-free optimization on the orthogonal manifold and convolution-free spatial mixing to obviate the limitations of earlier Lipschitz-based approaches. By leveraging orthogonal projections, parameter-free spatial shift modules, a tractable 1-Lipschitz nonlinearity (β-Abs), and $L_2$ spatial pooling, LipNeXt achieves state-of-the-art clean and certified robust accuracy (CRA) across standard benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet, with efficient and stable low-precision training (Hu et al., 26 Jan 2026).

1. Problem Statement and Scalability Challenges

LipNeXt is motivated by the need for deterministic (worst-case) $\ell_2$ -certified robustness in high-dimensional image classification tasks. Certification demands a classifier $F(x)=\operatorname{sign}f(x)$ that remains constant under all perturbations satisfying $\|x-x'\|_2 \leq \epsilon$ for input $x$ . If $f$ is $K$ -Lipschitz, the certified radius is given by $\frac{|f(x)|}{K}$ , so a tight $K=1$ is essential for nontrivial guarantees. Historically, Lipschitz-certified architectures have suffered from scaling limits due to heavy reparameterizations (SVD, FFT, Taylor expansions) or indirect regularizations, leading to computational bottlenecks, numerical fragility (especially for low precision), and small certified model sizes (≤32M parameters) with subpar ImageNet performance.

LipNeXt circumvents these barriers through two central innovations:

Constraint-free manifold optimization directly on the Stiefel (orthogonal) manifold, utilizing efficient “FastExp” matrix exponential approximations and periodic stabilizations.
Convolution-free spatial mixing using a norm-preserving Spatial Shift Module, proven to be the only isometric depthwise convolution with circular padding. This approach enables scalability to hundreds of millions or billions of parameters without sacrificing determinism or efficiency (Hu et al., 26 Jan 2026).

2. Tight Lipschitz Operations: Foundations

All constituent operations in LipNeXt are engineered to be exactly 1-Lipschitz, guaranteeing that their composition maintains this property.

Orthogonal Projections:

Linear maps are parameterized by orthogonal matrices $W \in O(n)$ (i.e., $W^\top W = I$ ), ensuring $\|Wx\|_2 = \|x\|_2$ . These parameters are updated via Riemannian gradient descent on the manifold:

$M_n = \{ W \in \mathbb{R}^{n \times n} \mid W^\top W = I \}$

Projected gradient: $\operatorname{grad} L(W) = \nabla L(W) - W \cdot \text{sym}(W^\top \nabla L(W))$
Retraction: $W_+ \leftarrow W \cdot \exp(-\eta \cdot \text{skew}(W^\top \operatorname{grad} L(W)))$

Spatial Shift Module:

Channels are partitioned into five groups and shifted circularly (up, down, left, right, none) parameter-free:

$F_\text{out}[i, j, k] = F_\text{in}[i + \Delta_i(k) \bmod H, j + \Delta_j(k) \bmod W, k]$

Theorem 1 establishes that such spatial shifts are the only norm-preserving depthwise convolutions under circular padding.

β-Abs Nonlinearity:

For $x \in \mathbb{R}^d$ ,

$[\operatorname{β-Abs}(x)]_i = \begin{cases} |x_i| & i \leq \beta d \ x_i & \text{otherwise} \end{cases}$

Any $0 \leq \beta \leq 1$ yields a 1-Lipschitz activation. For $\beta = 1$ , the activation is purely absolute value; for $\beta = 0.5$ , it is equivalent (up to orthogonal change of basis) to MinMax.

$L_2$ Spatial Pooling:

Reduces $H \times W$ maps to a vector by per-channel Euclidean norms:

$y_c = \left( \sum_{h=1}^H \sum_{w=1}^W x_{h, w, c}^2 \right)^{1/2}$

Thus, $\|L2SpatialPool(X) - L2SpatialPool(Y)\|_2 \leq \|X - Y\|_F$ , preserving 1-Lipschitzness.

3. Network Architecture

LipNeXt employs a macrostructure analogous to ConvNeXt/MetaFormer but substitutes all mixing and nonlinearity blocks with strict Lipschitz operations. A single LipNeXt block for input $X \in \mathbb{R}^{H \times W \times C}$ is defined as:

Add learnable positional embedding $p \in \mathbb{R}^{H \times W \times 1}$ : $X' = X + p$ .
Channel mixing: $R \sim O(C)$ , $Y = R^\top \cdot S(RX')$ .
Pointwise orthogonal linear + bias + activation: $Z = \operatorname{β-Abs}(MY + b)$ , with $M \in O(C)$ .

Multiple such blocks are stacked (e.g., 32 or more) for substantial representational capacity. The classifier head uses $L_2$ spatial pooling, followed by a 1-Lipschitz linear classification layer. Core block equations:

$Y = R^\top\,S(R(X + p))$

$Z = \operatorname{β-Abs}(MY + b)$

4. Training Algorithm and Manifold Optimization

All orthogonal parameters are optimized directly on the Stiefel manifold using a stabilized Riemannian Adam variant, incorporating the “FastExp” routine for efficient matrix exponential calculations:

Algorithm 2: Stabilized Manifold Adam Optimizer with FastExp

Input: learning rate η ≈ 10^{-3}, Adam coefficients β_1, β_2, Lookahead period K, epoch length N.
Init: orthogonal X, slow copy X_slow ← X, buffer B ← 0, moments m ← 0, v ← I/d.
for t = 1...T do
   G ← projected gradient via eq. (1)
   m ← β_1 m + (1−β_1) G
   v ← β_2 v + (1−β_2) G⋅G
   Δ_t ← −η m / √v
   B ← B + Δ_t
   if (t mod K) ≠ 0:
       X ← X ⋅ FastExp(Δ_t)
   else:
       X_slow ← X_slow ⋅ FastExp(B/2)
       X ← X_slow
       B ← 0
   if (t mod N) = 0:
       [X = UΣV^⊤ ← svd(X); X ← UV^⊤; X_slow ← X]
endfor

The FastExp routine adaptively truncates the matrix exponential based on

\|A\|_F

, with periodic polar retractions and lookahead-in-tangent steps stabilizing numerical drift.

5. Theoretical Guarantees

As every LipNeXt operation is exactly 1-Lipschitz and their function composition preserves this bound, the entire network $f$ satisfies $\operatorname{Lip}(f) \leq 1$ . For classifier $F(x) = \arg\max_i f_i(x)$ , robustness is certifiable for all perturbations $\|\delta\|_2 \leq \epsilon$ ; the top-logit ranking cannot change unless the margin

$f_y(x) - \max_{j \neq y} f_j(x)$

crosses zero. Thus, the certified radius is at least equal to the logit margin. This foundation enables deterministic, efficient certification for deep models (Hu et al., 26 Jan 2026).

6. Empirical Performance

LipNeXt achieves state-of-the-art clean and certified robust accuracy (CRA) on CIFAR-10/100, Tiny-ImageNet, and ImageNet. Notably, LipNeXt models scale smoothly in both depth and width, reaching up to 2B parameters with non-saturating gains. Training remains stable in low precision (bfloat16), matching the throughput of standard non-certified models.

Selected Results (Exact Table Extracts):

Dataset	Model	#Param.	Clean Acc.	CRA@36/255	CRA@72/255
CIFAR-10	LiResNet (83M)	81.0	69.8	56.3	42.9
	LipNeXt 32W1024 (64M)	81.5	71.2	59.2	45.9
	LipNeXt 32W2048 (256M)	85.0	73.2	58.8	43.3
CIFAR-100	LiResNet (83M)	53.0	40.2	28.3	19.2
	LipNeXt 32W1024 (64M)	53.3	41.3	30.5	21.8
	LipNeXt 32W2048 (256M)	57.4	44.1	31.9	22.2
Tiny-ImageNet	LiResNet† (83M)	40.9	26.2	15.7	8.9
	LipNeXt 32W1024 (64M)	42.5	32.0	21.8	15.2
	LipNeXt 32W2048 (256M)	45.5	35.0	25.9	18.0

On ImageNet (full 1000 classes), LipNeXt scales to 1-2B parameters, achieving:

LipNeXt L32W4096 (1B): Clean Acc. $40.2\%$ , CRA@ $\varepsilon{=}1$ $21.1\%$ , Clean Acc.@$36/255$ $55.9\%$ , CRA@$36/255$ $40.3\%$ .
LipNeXt L32W5792 (2B): Clean Acc. $41.0\%$ , CRA@ $\varepsilon{=}1$ $22.4\%$ , Clean Acc.@$36/255$ $57.0\%$ , CRA@$36/255$ $41.2\%$ .

Training throughput matches non-certified networks (single node, 8×H100); low-precision (bfloat16) operation is stable. Comparative architectures, including LiResNet and BRONet, are outperformed in both certified accuracy and efficiency (Hu et al., 26 Jan 2026).

7. Engineering Considerations, Limitations, and Future Directions

LipNeXt’s manifold optimization eliminates heavy per-step costs (no SVD/FFT/Taylor expansion for each update), relying on simple matrix multiplications and efficient FastExp approximations. Numerical stability is maintained through periodic polar retractions. Stable bfloat16 training is achieved without the overflow or instability present in competitors such as LiResNet or BRONet.

Limitations include robust overfitting on smaller datasets (e.g., CIFAR) unless supplemented by synthetic data or explicit regularizers. The spatial shift design depends on circular padding and positional encoding; more expressive but 1-Lipschitz token mixers may yield future gains. Certification radius is ultimately bounded by the logit margin alone; bridging the gap to empirical $\ell_2$ robustness remains open. Extensions to other norms (e.g., $\ell_\infty$ ) or domains (non-image data such as speech, graphs) are proposed avenues for further exploration (Hu et al., 26 Jan 2026).

Overall, LipNeXt establishes the feasibility of deterministic Lipschitz-based certification at billion-parameter scale, combining exact robustness guarantees, competitive clean and certified accuracy, stable training in reduced precision, and scalable engineering.

Markdown Report Issue Upgrade to Chat

References (1)

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LipNeXt.