Deep Equilibrium Networks (DEQs)

Updated 5 February 2026

Deep Equilibrium Networks (DEQs) are implicit neural architectures that compute outputs as the fixed point of a nonlinear transform, enabling infinite-depth models with constant memory usage.
They use root-finding algorithms like Broyden’s method to solve for the equilibrium point and apply implicit differentiation for gradient computation without explicit layer unrolling.
DEQs have been successfully applied in language modeling, vision, inverse problems, and quantum computing, often matching or exceeding traditional deep architectures in performance and efficiency.

Deep Equilibrium Networks (DEQs) are a class of implicit neural architectures that define network outputs as fixed points of nonlinear transformations rather than by explicit, finite unrolling of layers. Unlike deep conventional models, which compose many parametrized layers, DEQs collapse this recurrence to a single transform whose equilibrium is found using a root-finding algorithm. This approach enables the modeling of infinitely deep, weight-tied representations, making memory use constant with respect to depth, and permitting backpropagation via implicit differentiation. DEQs have been validated across domains including language modeling, vision, inverse problems, and quantum computing, often matching or exceeding the performance of explicit deep architectures at much reduced memory cost (Bai et al., 2019).

1. Mathematical Formulation and Foundations

A DEQ is specified by a parameterized operator $f_\theta$ acting on a hidden state $\mathbf{z} \in \mathbb{R}^d$ and input $\mathbf{x}$ . Instead of the explicit multi-layer recursion

$\mathbf{h}_0 = \mathbf{x}, \quad \mathbf{h}_{k+1} = f_\theta(\mathbf{h}_k; \mathbf{x}),$

DEQs directly seek the equilibrium

$\mathbf{z}^* = f_\theta(\mathbf{z}^*; \mathbf{x}),$

i.e., a solution to the fixed point equation. Root-finding methods (e.g., Broyden's method or Anderson acceleration) are used to approximate $\mathbf{z}^*$ , enabling the implicit modeling of infinitely deep weight-tied compositions.

During training, gradients with respect to parameters are computed by differentiating through the equilibrium via the Implicit Function Theorem:

$\frac{\partial \ell}{\partial \theta} = -\frac{\partial \ell}{\partial \mathbf{z}^*} \cdot \left( \frac{\partial f_\theta}{\partial \mathbf{z}} \bigg|_{\mathbf{z}^*} - I \right)^{-1} \frac{\partial f_\theta}{\partial \theta} \bigg|_{\mathbf{z}^*}$

This decouples the backward pass from the explicit trajectory of intermediate layer activations, resulting in $\mathcal{O}(1)$ memory usage with respect to effective depth (Bai et al., 2019).

2. Solvers, Differentiation, and Stability

DEQs rely on efficient black-box solvers to find the equilibrium. Broyden's method is commonly used, iteratively approximating the inverse Jacobian of $g(\mathbf{z}) = f_\theta(\mathbf{z}; \mathbf{x}) - \mathbf{z}$ and achieving convergence in tens of steps for well-conditioned systems. Empirically, this is both faster and more stable than deep explicit unrolling with tied weights, which is often plagued by oscillations and divergence (Bai et al., 2019).

The backward pass uses the same fixed-point machinery to solve the adjoint system for implicit gradients. Since iterative differentiation reuses the machinery of the forward pass, overall memory requirements remain fixed with respect to “depth.”

Theoretical challenges in DEQ training include:

Lack of general guarantees for equilibrium existence or uniqueness unless $f_\theta$ is a contraction mapping (i.e., $\|\partial f_\theta/\partial \mathbf{z}\| < 1$ everywhere).
Stability sensitivity to the statistics of weight initialization. Orthogonal or symmetric matrix initializations (e.g., Haar-distributed or GOE) significantly improve convergence and allow larger initial scales compared to i.i.d. Gaussian matrices. Empirical studies show that using normal ensembles for initialization yields broader stable regimes and tighter performance distribution across seeds (Agarwala et al., 2022).
In practice, Jacobian regularization—penalizing $\|J_{f_\theta}(\mathbf{z^*})\|$ at equilibrium—effectively controls instability, improves convergence, and can be efficiently estimated with trace estimators (Bai et al., 2021).

3. Model Variants and Theoretical Guarantees

Several DEQ variants have been proposed to address the foundational challenges in existence, uniqueness, and convergence:

Positive Concave DEQs (pcDEQ): By enforcing nonnegativity and concavity constraints on weights and activations, the operator becomes a monotone, standard interference mapping. This structure guarantees global existence and uniqueness of the fixed point and ensures geometric convergence using nonlinear Perron–Frobenius theory. The approach is empirically competitive and markedly more stable than general DEQs (Gabor et al., 2024).
Lipschitz-Controlled and Multiscale DEQs: Recent architectures explicitly tune the Lipschitz constant of $f_\theta$ via architectural design (e.g., through spectral normalization, scaling activations, or convex skip connections). As a result, the overall Lipschitz constant can be kept below 1, guaranteeing unique fixed points and linear convergence for both forward and backward passes. This yields substantial speed-ups (up to 4.75× on CIFAR-10) with only minor accuracy trade-offs (Sato et al., 3 Feb 2026).
Distributional DEQs (DDEQs): Extending the fixed-point paradigm to spaces of probability measures, DDEQs leverage Wasserstein gradient flows to find measure-valued equilibria. These models natively encode permutation invariance (sets/point clouds) and achieve competitive parameter efficiency and performance on tasks such as point cloud classification and completion (Geuter et al., 3 Mar 2025).

4. Application Domains and Empirical Results

DEQs have been instantiated in diverse domains:

Sequence Modeling: DEQ-Transformer and DEQ-TrellisNet directly solve for an equilibrium in sequence-length × hidden-dim space. On large-scale language modeling (e.g., WikiText-103), DEQ-Transformer achieves test perplexity of 24.2 (172M params), closely matching or improving over deep Transformer-XL baselines, while reducing training memory from 8.5GB to 2.7GB (≈68%) (Bai et al., 2019).
Vision: Multiscale DEQ (MDEQ) for image classification and segmentation achieves performance comparable to deep explicit ResNets and Transformers, using constant memory and showing strong OOD generalization (Geng et al., 2023).
Inverse Problems: DEQs have been developed for imaging problems via plug-and-play proximal operators, Bregman mirror descent (for Poisson noise), and self-supervised equivariant imaging losses (enabling training without ground truth via exploitation of group symmetries). These models attain competitive or superior PSNR and perceptual metrics compared to Bregman-PnP methods and U-Net baselines (Mehta et al., 24 Nov 2025, Daniele et al., 15 Jul 2025).
Quantum Learning: Quantum DEQ (QDEQ) architectures apply the fixed-point paradigm to parametrized quantum circuits, enabling learning with significantly reduced quantum circuit depth and parameters. QDEQ outperforms explicit quantum circuits with 5× more layers in several benchmarks (Schleich et al., 2024).
Kernel and Generalization Theory: In the infinite-width limit, DEQs exhibit a correspondence with Gaussian processes (NNGP), with strictly positive definite limiting kernels. Importantly, the order of the infinite-width and infinite-depth limits commutes, differentiating DEQs from standard deep MLPs. This underpins robust training and benign overfitting properties, with kernel ridge regression accurately modeling generalization. In high-dimensional regimes, explicit nets of moderate depth can replicate the kernel properties of a DEQ (Gao et al., 2023, Ling et al., 2024).

5. Training, Optimization, and Inference Acceleration

While DEQs offer depth-invariant memory, iterative root solving introduces computational overhead. Several algorithmic innovations have been developed to alleviate this:

Approximate gradient methods, such as GDEQ, reuse the inverse Jacobian approximations from forward Broyden iterations during backpropagation—substantially accelerating training with negligible degradation (Nguyen et al., 2023).
Consistency methods (C-DEQ) reframe inference through consistency distillation, learning a mapping that projects intermediate states directly to the equilibrium. This yields 2–20× speed-ups in low-NFE (number of function evaluations) regimes while preserving original accuracy (Lin et al., 3 Feb 2026).
Optimized libraries, such as TorchDEQ, unify solver and backward routines, support multiple solver types, and encapsulate normalization and regularization best practices. Across the “DEQ Zoo” (language, vision, graph, flow, INR, diffusion), such frameworks have systematically improved accuracy, training stability, and efficiency (Geng et al., 2023).

6. Robustness, Certification, and Regularization

Certified Robustness: Traditional interval bound propagation and Lipschitz-bounded certification methods have limited scalability on large-scale DEQs. Serialized Randomized Smoothing (SRS) adapts randomized smoothing for DEQs, reusing previous equilibrium solutions across correlated input samples. SRS achieves up to 7× speed-up in certification with minimal loss in accuracy compared to independent solves (Gao et al., 2024).
Adversarial Robustness: Standard adversarial training for DEQs under-regularizes intermediate states, as only terminal fixed points are constrained. Explicit regularization along the entire trajectory—via input entropy reduction and random-layer adversarial penalties—substantially improves DEQ robustness over strong explicit-deep baselines (Yang et al., 2023).
Regularization: Penalizing the Jacobian at equilibrium, especially by Frobenius norm via Hutchinson trace estimation, stabilizes both the forward fixed-point entries and the backward implicit-gradient computation. This approach delivers competitive training time and performance to explicit deep nets on ImageNet and WikiText-103 (Bai et al., 2021).

7. Extensions, Theoretical Insights, and Future Directions

DEQs provide a natural paradigm for joint inference and input optimization: by augmenting the fixed-point system to simultaneously encode network and input optimization conditions, one can embed inner optimization loops (latent variable inference, adversarial attack generation, meta-learning) within a single implicit layer, achieving 3–9× end-to-end speed-ups and O(1) memory (Gurumurthy et al., 2021).
Mirror descent-based DEQs extend the fixed-point methodology to non-Euclidean optimization geometries, including Poisson inverse problems with Kullback–Leibler fidelity. The fixed-point operator’s relative smoothness yields convergence guarantees via the Kurdyka–Łojasiewicz inequality (Daniele et al., 15 Jul 2025).
The kernel limit and random matrix theory indicate that—at least in the high-dimensional regime—DEQs are functionally equivalent to carefully designed shallow explicit networks regarding NTK and CK spectra, providing a theoretical link between implicit and explicit models (Ling et al., 2024).

Continued study of convergence guarantees, initialization sensitivity, robust training schemes, and application to non-Euclidean and structured domains (e.g., sets, graphs, physical systems) is ongoing, with foundational results highlighting DEQs as a central abstraction uniting depth efficiency, implicit modeling, and fixed-point scheme learning across machine learning (Bai et al., 2019, Sato et al., 3 Feb 2026, Mehta et al., 24 Nov 2025, Geng et al., 2023).