Papers
Topics
Authors
Recent
Search
2000 character limit reached

C-DEQ: Consistency Deep Equilibrium Model

Updated 10 February 2026
  • C-DEQ is an implicit neural framework that reframes deep equilibrium inference as an ODE trajectory to achieve rapid fixed-point convergence.
  • It employs global and local consistency distillation, training a student model to predict equilibrium points in as few as one inference step.
  • The method delivers up to 20× faster inference with constant memory usage, significantly outperforming traditional deep equilibrium approaches.

A Consistency Deep Equilibrium Model (C-DEQ) is an implicit neural network framework that combines fixed-point modeling and consistency-driven distillation to produce efficient, low-latency inference while retaining the expressive power and constant memory advantages of standard Deep Equilibrium Models (DEQs). C-DEQ reframes DEQ inference as evolution along a canonical ODE trajectory and distills that process so a student model can perform "few-step" or even one-step prediction to the fixed point. This approach delivers substantial improvements in speed, computational efficiency, and accuracy over traditional DEQs, especially in settings with constrained inference budgets (Lin et al., 3 Feb 2026).

1. Foundations of Deep Equilibrium Models

A Deep Equilibrium Model describes hidden representations as fixed points of a nonlinear layer, defined implicitly by

z∗=fθ(z∗,x),z^* = f_\theta(z^*, x),

where fθf_\theta is a neural network parameterized by θ\theta, and xx is the input. Solving for z∗z^* requires iterative root-finding procedures such as Picard iteration, quasi-Newton (e.g., Broyden’s method), or Anderson Acceleration (AA). The main appeal of DEQs lies in their constant memory footprint with respect to depth, achieved by circumventing the storage of intermediate activations via the Implicit Function Theorem. However, the need for many forward iterations—often tens per example—to reach equilibrium incurs significant inference latency.

2. ODE Trajectory Perspective and Teacher Trajectories

Although the fixed point z∗z^* is path-independent, it is possible to associate each input with a unique trajectory by interpreting fixed-point iteration as discretizing an ODE: dz(t)dt=fθ(z(t),x)−z(t),z(0)=0.\frac{dz(t)}{dt} = f_\theta(z(t), x) - z(t),\quad z(0) = 0. This "fixed-point ODE" (FP-ODE) has a limiting state z∗z^* satisfying the DEQ equation. Using Anderson Acceleration with a fixed initial condition, the forward pass generates a "teacher" trajectory T={z0,…,zK}\mathcal{T} = \{z_0, \ldots, z_K\} with zK≈z∗z_K \approx z^* for any given input. This provides the scaffold along which consistency distillation is performed.

3. Consistency Distillation Objectives

In the C-DEQ framework, a parameterized student model gϕg_\phi is trained to directly map any intermediate state along the ODE trajectory (together with a time embedding) to the equilibrium point. The training objective combines three terms:

  • Global Consistency: Forcing gÏ•(zk,tk)g_\phi(z_k, t_k) to match the final state zKz_K, via

Lglobal=Ek[d(gϕ(zk,tk),zK)],\mathcal{L}_{\rm global} = \mathbb{E}_k \left[ d(g_\phi(z_k, t_k), z_K) \right],

where dd is typically mean-squared error and tkt_k is a continuous time embedding of iteration kk.

  • Local Consistency: Ensuring that applying gÏ•g_\phi at consecutive ODE steps yields stable, consistent outputs,

Llocal=Ek[d(gϕ(zk,tk),gϕ−(zk−1,tk−1))],\mathcal{L}_{\rm local} = \mathbb{E}_k \left[ d(g_\phi(z_k, t_k), g_{\phi^-}(z_{k-1}, t_{k-1})) \right],

with ϕ−\phi^- a moving average of parameters for stabilization.

  • Task Loss: An auxiliary term (cross-entropy or regression) to preserve accuracy on downstream outputs.

The combined distillation loss is: Ldistill=λ1Lglobal+(1−λ1)Llocal+λ2Ltask.\mathcal{L}_{\rm distill} = \lambda_1 \mathcal{L}_{\rm global} + (1 - \lambda_1) \mathcal{L}_{\rm local} + \lambda_2 \mathcal{L}_{\rm task}.

4. C-DEQ Architecture and Training Protocol

The student map gϕ(z≤t,t)g_\phi(z_{\le t}, t) is constructed as a boundary-mixed combination of the current state and a learned correction: gϕ(zt,t)=cskip(t)zt+cout(t)Pϕ(z≤t,t),g_\phi(z_t, t) = c_{\rm skip}(t) z_t + c_{\rm out}(t) P_\phi(z_{\le t}, t), where cskipc_{\rm skip} and coutc_{\rm out} provide a schedule- and time-dependent mixing between the identity and the correction, and PϕP_\phi employs an AA-style one-step Anderson update with a learned residual network hϕh_\phi. Training proceeds by first caching the AA-based teacher trajectory, then randomly sampling ODE steps and updating ϕ\phi using stochastic gradient descent on the total distillation loss, while maintaining an EMA "target" network for local stability.

5. Inference Algorithms and Computational Tradeoffs

C-DEQ enables flexible inference modes:

  • One-step Inference: Input z0z_0 at t0t_0 and directly predict zT=gÏ•(z0,t0)z_T = g_\phi(z_0, t_0).
  • Few-step Chaining: Select JJ intermediate time points up to TT. For each, recursively apply gÏ•g_\phi using the AA-style update and boundary-mixing, producing a final prediction after JJ steps.

This framework allows explicit trade-off between computational budget (via number of function evaluations, NFEs) and solution quality. Empirical results demonstrate that C-DEQ achieves high accuracy in as few as 1–8 steps.

6. Empirical Evaluation Across Modalities

Extensive benchmarks confirm the efficacy of C-DEQ in various domains under few-step inference constraints:

Task / Model NFE=1 NFE=2 NFE=8
WikiText-103
DEQ PPL 255.9 / 0.09s PPL 223.4 / 0.17s PPL 104.3 / 0.65s
HyperDEQ 70.2 / 0.73s 51.3 / 0.80s 31.4 / 1.21s
C-DEQ 47.9 / 0.05s 39.0 / 0.09s 26.4 / 0.37s
ImageNet
DEQ 1.17% / 0.48s 8.12% / 0.67s 64.13% / 0.85s
C-DEQ 47.1% / 0.52s 58.3% / 0.69s 74.0% / 0.87s
ogbn-arxiv
IGNN 8.6% / 0.03s 13.8% / 0.05s 45.9% / 0.18s
C-DEQ 56.8% / 0.05s 67.5% / 0.08s 71.4% / 0.16s

Across domains, C-DEQ yields 2–20× accuracy gains over baseline DEQ variants at identical function evaluation budgets, and matches or exceeds explicit baselines at substantially lower memory costs (Lin et al., 3 Feb 2026).

7. Comparison with Conventional DEQs and Implications

  • Inference Speed: C-DEQ reduces the equilibrium-approach NFE from tens (standard DEQ) to the 1–8 range, yielding up to 20× faster convergence at target error thresholds.
  • Memory and Complexity: Maintains O(1)O(1) memory for both training and inference. Per-step computation matches standard DEQ layer evaluation, plus a minor AA-style mixing overhead.
  • Practicality: Memory overhead compared to vanilla DEQ is negligible (~0.1 GB), while explicit sequence models (e.g., Transformer-XL) require order-of-magnitude more memory (>7 GB).
  • Modularity: The C-DEQ approach retains DEQ’s hardware-agnostic, depth-constant advantages and can be adapted to diverse architectures and domains without architectural changes to the underlying solver.

A plausible implication is that C-DEQ enables deployment of powerful fixed-point implicit models in latency- or resource-constrained environments previously inaccessible to classical DEQs (Lin et al., 3 Feb 2026).


In summary, the Consistency Deep Equilibrium Model framework distills a canonical ODE-based inference trajectory into a low-latency, dynamically composable mapping. By leveraging global and local consistency losses and AA-informed neural architectures, C-DEQ achieves DEQ-level equilibrium accuracy in very few steps—attaining up to 20× efficiency improvements over prior implicit strategies—while preserving the O(1)O(1) memory and computation-vs.-accuracy tradeoff flexibility inherent in the DEQ paradigm (Lin et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Consistency Deep Equilibrium Model (C-DEQ).