C-DEQ: Consistency Deep Equilibrium Model

Updated 10 February 2026

C-DEQ is an implicit neural framework that reframes deep equilibrium inference as an ODE trajectory to achieve rapid fixed-point convergence.
It employs global and local consistency distillation, training a student model to predict equilibrium points in as few as one inference step.
The method delivers up to 20× faster inference with constant memory usage, significantly outperforming traditional deep equilibrium approaches.

A Consistency Deep Equilibrium Model (C-DEQ) is an implicit neural network framework that combines fixed-point modeling and consistency-driven distillation to produce efficient, low-latency inference while retaining the expressive power and constant memory advantages of standard Deep Equilibrium Models (DEQs). C-DEQ reframes DEQ inference as evolution along a canonical ODE trajectory and distills that process so a student model can perform "few-step" or even one-step prediction to the fixed point. This approach delivers substantial improvements in speed, computational efficiency, and accuracy over traditional DEQs, especially in settings with constrained inference budgets (Lin et al., 3 Feb 2026).

1. Foundations of Deep Equilibrium Models

A Deep Equilibrium Model describes hidden representations as fixed points of a nonlinear layer, defined implicitly by

$z^* = f_\theta(z^*, x),$

where $f_\theta$ is a neural network parameterized by $\theta$ , and $x$ is the input. Solving for $z^*$ requires iterative root-finding procedures such as Picard iteration, quasi-Newton (e.g., Broyden’s method), or Anderson Acceleration (AA). The main appeal of DEQs lies in their constant memory footprint with respect to depth, achieved by circumventing the storage of intermediate activations via the Implicit Function Theorem. However, the need for many forward iterations—often tens per example—to reach equilibrium incurs significant inference latency.

2. ODE Trajectory Perspective and Teacher Trajectories

Although the fixed point $z^*$ is path-independent, it is possible to associate each input with a unique trajectory by interpreting fixed-point iteration as discretizing an ODE: $\frac{dz(t)}{dt} = f_\theta(z(t), x) - z(t),\quad z(0) = 0.$ This "fixed-point ODE" (FP-ODE) has a limiting state $z^*$ satisfying the DEQ equation. Using Anderson Acceleration with a fixed initial condition, the forward pass generates a "teacher" trajectory $\mathcal{T} = \{z_0, \ldots, z_K\}$ with $z_K \approx z^*$ for any given input. This provides the scaffold along which consistency distillation is performed.

3. Consistency Distillation Objectives

In the C-DEQ framework, a parameterized student model $g_\phi$ is trained to directly map any intermediate state along the ODE trajectory (together with a time embedding) to the equilibrium point. The training objective combines three terms:

Global Consistency: Forcing $g_\phi(z_k, t_k)$ to match the final state $z_K$ , via

$\mathcal{L}_{\rm global} = \mathbb{E}_k \left[ d(g_\phi(z_k, t_k), z_K) \right],$

where $d$ is typically mean-squared error and $t_k$ is a continuous time embedding of iteration $k$ .

Local Consistency: Ensuring that applying $g_\phi$ at consecutive ODE steps yields stable, consistent outputs,

$\mathcal{L}_{\rm local} = \mathbb{E}_k \left[ d(g_\phi(z_k, t_k), g_{\phi^-}(z_{k-1}, t_{k-1})) \right],$

with $\phi^-$ a moving average of parameters for stabilization.

Task Loss: An auxiliary term (cross-entropy or regression) to preserve accuracy on downstream outputs.

The combined distillation loss is: $\mathcal{L}_{\rm distill} = \lambda_1 \mathcal{L}_{\rm global} + (1 - \lambda_1) \mathcal{L}_{\rm local} + \lambda_2 \mathcal{L}_{\rm task}.$

4. C-DEQ Architecture and Training Protocol

The student map $g_\phi(z_{\le t}, t)$ is constructed as a boundary-mixed combination of the current state and a learned correction: $g_\phi(z_t, t) = c_{\rm skip}(t) z_t + c_{\rm out}(t) P_\phi(z_{\le t}, t),$ where $c_{\rm skip}$ and $c_{\rm out}$ provide a schedule- and time-dependent mixing between the identity and the correction, and $P_\phi$ employs an AA-style one-step Anderson update with a learned residual network $h_\phi$ . Training proceeds by first caching the AA-based teacher trajectory, then randomly sampling ODE steps and updating $\phi$ using stochastic gradient descent on the total distillation loss, while maintaining an EMA "target" network for local stability.

5. Inference Algorithms and Computational Tradeoffs

C-DEQ enables flexible inference modes:

One-step Inference: Input $z_0$ at $t_0$ and directly predict $z_T = g_\phi(z_0, t_0)$ .
Few-step Chaining: Select $J$ intermediate time points up to $T$ . For each, recursively apply $g_\phi$ using the AA-style update and boundary-mixing, producing a final prediction after $J$ steps.

This framework allows explicit trade-off between computational budget (via number of function evaluations, NFEs) and solution quality. Empirical results demonstrate that C-DEQ achieves high accuracy in as few as 1–8 steps.

6. Empirical Evaluation Across Modalities

Extensive benchmarks confirm the efficacy of C-DEQ in various domains under few-step inference constraints:

Task / Model	NFE=1	NFE=2	NFE=8
WikiText-103
DEQ	PPL 255.9 / 0.09s	PPL 223.4 / 0.17s	PPL 104.3 / 0.65s
HyperDEQ	70.2 / 0.73s	51.3 / 0.80s	31.4 / 1.21s
C-DEQ	47.9 / 0.05s	39.0 / 0.09s	26.4 / 0.37s
ImageNet
DEQ	1.17% / 0.48s	8.12% / 0.67s	64.13% / 0.85s
C-DEQ	47.1% / 0.52s	58.3% / 0.69s	74.0% / 0.87s
ogbn-arxiv
IGNN	8.6% / 0.03s	13.8% / 0.05s	45.9% / 0.18s
C-DEQ	56.8% / 0.05s	67.5% / 0.08s	71.4% / 0.16s

Across domains, C-DEQ yields 2–20× accuracy gains over baseline DEQ variants at identical function evaluation budgets, and matches or exceeds explicit baselines at substantially lower memory costs (Lin et al., 3 Feb 2026).

7. Comparison with Conventional DEQs and Implications

Inference Speed: C-DEQ reduces the equilibrium-approach NFE from tens (standard DEQ) to the 1–8 range, yielding up to 20× faster convergence at target error thresholds.
Memory and Complexity: Maintains $O(1)$ memory for both training and inference. Per-step computation matches standard DEQ layer evaluation, plus a minor AA-style mixing overhead.
Practicality: Memory overhead compared to vanilla DEQ is negligible (~0.1 GB), while explicit sequence models (e.g., Transformer-XL) require order-of-magnitude more memory (>7 GB).
Modularity: The C-DEQ approach retains DEQ’s hardware-agnostic, depth-constant advantages and can be adapted to diverse architectures and domains without architectural changes to the underlying solver.

A plausible implication is that C-DEQ enables deployment of powerful fixed-point implicit models in latency- or resource-constrained environments previously inaccessible to classical DEQs (Lin et al., 3 Feb 2026).

In summary, the Consistency Deep Equilibrium Model framework distills a canonical ODE-based inference trajectory into a low-latency, dynamically composable mapping. By leveraging global and local consistency losses and AA-informed neural architectures, C-DEQ achieves DEQ-level equilibrium accuracy in very few steps—attaining up to 20× efficiency improvements over prior implicit strategies—while preserving the $O(1)$ memory and computation-vs.-accuracy tradeoff flexibility inherent in the DEQ paradigm (Lin et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Consistency Deep Equilibrium Models (2026)

Topic to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Consistency Deep Equilibrium Model (C-DEQ).