Generalization Error Bounds for Neural ODEs

Updated 17 February 2026

The paper establishes rigorous quantitative generalization error bounds for Neural ODEs through statistical learning formulations and capacity control techniques.
It reveals that factors like overparameterization, time horizon, and parameter path smoothness critically affect convergence rates and model stability.
The work provides actionable guidelines for designing architectures and training strategies to ensure C1-level generalization and accurate dynamical behavior.

Neural ordinary differential equations (Neural ODEs) define deep learning models in which transformations are parameterized by differential equations, allowing continuous-depth architectures with strong inductive bias for modeling dynamical systems and invertible flows. While the expressive power and applied success of Neural ODEs are widely recognized, rigorous quantitative generalization error bounds—describing the statistical risk and sample complexity for unseen data—have emerged only recently. This article systematically surveys the state of the art in quantitative generalization error bounds for Neural ODEs, focusing on the precise statistical, dynamical, and architectural factors governing generalization rates.

1. Statistical Learning Formulations for Neural ODEs

Quantitative generalization analysis for Neural ODEs is predicated on formalizing statistical learning setups in which the ODE-parameterized flows model the data generation process. One central example is parameterizing invertible transformations to represent complex densities: given a domain $D = [0,1]^d$ and a reference measure $\rho$ (Lipschitz, bounded below and above), the unknown data density $p_0$ generates samples $Z_1, ..., Z_n \sim p_0$ . Velocity fields $f \in \mathcal{F} \subset C^1(\Omega; \mathbb{R}^d)$ , with suitable boundary constraints to ensure diffeomorphic flows, induce flow maps $X^f$ satisfying $dX^f/dt = f(X^f,t)$ and generate densities $p^f$ via the pushforward $p^f(x) = \rho(T^f(x)) \det \nabla T^f(x)$ , with $T^f(x) = X^f(x,1)$ .

Risk is measured by the Hellinger distance: $h(p^f, p_0) = \|\sqrt{p^f} - \sqrt{p_0}\|_{L^2(D)}$ . The estimator $\hat{f}$ is obtained by likelihood maximization: $\hat{f} \in \arg\max_{f \in \mathcal{F}} \sum_{i=1}^n \log p^f(Z_i)$ (Marzouk et al., 2023).

Alternative settings consider supervised learning with i.i.d. data $(x_i, y_i)$ , measurable prediction $y = h_\theta(x)$ via ODE solutions $\dot{z}(t) = f(z(t), t, \theta(t))$ , and standard statistical losses. In ergodic or chaotic dynamics, generalization is characterized via the discrepancy between learned and invariant measures—necessitating metrics such as Wasserstein distance or dynamical shadowing error (Park et al., 2024).

2. General Nonparametric Statistical Convergence for Density Learning

Distribution learning via likelihood maximization over ODE models is governed by the trade-off between approximation (bias) and complexity (variance). The main general result, Theorem 2.5 in (Marzouk et al., 2023), establishes that for any estimator $\hat{f}$ maximizing empirical likelihood over $\mathcal{F}$ , and any oracle $f^* \in \mathcal{F}$ ,

$\mathbb{E}[h^2(p^{\hat{f}}, p_0)] \leq C \left( h^2(p^{f^*}, p_0) + \delta_n^2 + \frac{1}{n} \right),$

where $\delta_n$ is defined via the $C^1$ -metric entropy integral $\Psi$ of $\mathcal{F}$ : $\sqrt{n} \delta_n^2 \gtrsim \Psi(\delta_n)$ . The bias is $h(p^{f^*}, p_0)$ , and the variance is controlled by the $C^1$ -covering number of $\mathcal{F}$ (Marzouk et al., 2023).

Specialized Rates

For $C^k$ -smooth velocity fields: If $p_0, \rho \in C^k$ , $k > d/2 + 3/2$ , then

$\mathbb{E}[h^2(p^{\hat{f}}, p_0)] \lesssim n^{-2(k-1-\gamma)/[2(k-1-\gamma)+d+1]}$

for any small $\gamma > 0$ .

For Neural ODE velocity fields (with $\operatorname{ReLU}^2$ activations for $C^1$ flow regularity):

$\mathbb{E}[h^2(p^{\hat{f}}, p_0)] \lesssim n^{-2(k-1)/[d+1 + 2(k-1)]} \log n,$

with architecture width and sparsity scaling as $n^{(d+1)/[2(k-1)+d+1]}$ , constant depth, and weight norm scaling as $n^{1/[2(k-1)+d+1]}$ .

A distinctive feature is the extra $+1$ in the effective dimension (from time dependence in the velocity field), which slightly deteriorates the rate compared to classical finite-dimensional settings.

3. Capacity Control, Bounded Variation, and Overparameterization

An alternative, general analysis for both time-dependent and time-independent Neural ODEs, under Lipschitz vector fields, leverages the bounded-variation property of ODE solutions (Verma et al., 26 Aug 2025). The uniform generalization bound (with high probability) for an empirical-risk minimizer $\hat{h}$ is

$R(\hat{h}) \leq R^n(\hat{h}) + O\left(L_\ell \sqrt{T V' d^{3/2}/n}\right)$

where $L_\ell$ is the Lipschitz constant of the loss, $T$ the integration time, $V'$ is a bound on the norm of the vector field $f$ , and $d$ the output dimension (Verma et al., 26 Aug 2025). The constant $V'$ bundles parameter norms and Lipschitz constants. For time-independent ODEs, the bound does not depend on network width but grows linearly with depth.

This explicitly demonstrates that overparameterization—specifically, large depth or high parameter norms—inflates the capacity term, leading to looser generalization bounds. Empirical results confirm that networks with larger Lipschitz constants exhibit larger generalization gaps.

4. Generalization in Dynamical Systems and Ergodic Regimes

Traditional generalization metrics can fail to detect qualitative discrepancies in long-term dynamics for learned ODE generators, especially in ergodic or chaotic systems (Park et al., 2024). In this vein, generalization is defined in terms of $C^1$ -strong approximation: small sup-norm errors in both $G(x) - F(x)$ and $dG(x) - dF(x)$ over all $x$ in the phase space guarantee orbit shadowing (hyperbolic shadowing), and thus convergence of the empirical invariant measure of the learned system to the true physical measure $\mu$ .

Formally, if $P$ denotes the total number of parameters (e.g., $P = O(LW^2)$ for fully connected nets), then $C^1$ -uniform generalization and statistical shadowing accuracy are achieved whenever $n \gtrsim P \log P / \epsilon^2$ . The resulting generalization error in Wasserstein distance satisfies

$W^1(\hat{\mu}_N, \mu) \leq O(\sqrt{P / n})$

for the empirical invariant measure $\hat{\mu}_N$ of the learned flow. Notably, MSE-only loss functions fail to control such generalization; only $C^1$ (Jacobian-matching) regularization yields bounds guaranteeing correct physical statistics in ergodic regimes (Park et al., 2024).

5. Time Horizon and Overparameterization: Depth, Stability, and Structural Constraints

Generalization error bounds in Neural ODEs fundamentally depend on architectural factors:

Time horizon (depth): In the large-time regime, the final time $T$ of the ODE (analogous to network depth) governs both training error decay and hypothesis class complexity. Under $L^2$ -regularization, training error decays as $O(1/T)$ and population risk as $O(1/\sqrt{n})$ for $T \sim n^{1/2}$ (Esteve et al., 2020). With additional trajectory-tracking regularization, training error decays exponentially, i.e., $O(e^{-\mu T})$ , leading again to population risk $O(n^{-1/2})$ for any $T \geq T^*$ .
Parameter path smoothness: In continuous-time parameterized ODEs, generalization rates interpolate between $n^{-1/2}$ (for time-independent parameters) and $n^{-1/4}$ (for Lipschitz-varying parameter paths), reflecting the infinite-dimensional complexity of the path class (Marion, 2023).
PAC bounds and stability: By embedding Neural ODEs as continuous-time linear parameter-varying (LPV) systems, generalization error bounds scaling as $O(1/\sqrt{N})$ can be achieved (with constant-independent dependence on integration horizon, assuming weighted $H_2$ -stability), further controlled by parameter- and input-norm constraints (Rácz et al., 2023).

6. Infinite-Horizon and Dynamical Systems Guarantees

Recent developments have extended quantitative generalization guarantees to the infinite-horizon regime and to structurally stable dynamical system classes (Morse–Smale with multistability and limit cycles). For flows $\varphi, \hat{\varphi}$ , if $(\varepsilon, \delta)$ -closeness holds, i.e., the learned Neural ODE matches the reference flow up to error $\varepsilon$ for all but a $\delta$ fraction of initial conditions, then for any $p \geq 1$ the temporal $L^p$ generalization error satisfies

$\mathcal{E}_{p, \infty}(\varphi, \hat{\varphi}) \leq \varepsilon^p + \delta D^p$

where $D$ is the diameter of the phase space (Sagodi et al., 9 Feb 2026). This connects pointwise/topological guarantees to loss-based $L^p$ errors over arbitrarily long time horizons, a property unique to Neural ODEs among neural sequence models.

7. Implications for Architecture Design and Training

The quantitative bounds for Neural ODEs prescribe explicit design and training recommendations:

For target density regularity $k$ and sample size $n$ , architecture width/sparsity should scale as $n^{(d+1)/[2(k-1)+d+1]}$ ; depth should be kept constant, and weight norms scaled as $n^{1/[2(k-1)+d+1]}$ .
Overparameterization (large depth or uncontrolled parameter growth) degrades generalization. The extra degree in dimension (from time) can be overcome by regularizing trajectories or imposing structural constraints.
Achieving physically meaningful long-term dynamics or ergodic measures in learned models fundamentally requires $C^1$ -level generalization, not merely small prediction error. Penalizing Jacobian mismatch during training is essential.
Stability (via Lyapunov, spectral norm, or margin constraints) tightens generalization certificates and ensures independence of the integration horizon for certain classes (LPV–Neural ODE embeddings).
Penalizing rapidly varying parameter paths or enforcing layerwise smoothness in ResNet analogues recovers depth-independent generalization rates.

Open questions remain regarding rates in unbounded domains, quantitative guarantees for general training algorithms, and extension to stochastic/diffusive continuous-flow architectures.

References:

(Marzouk et al., 2023, Verma et al., 26 Aug 2025, Park et al., 2024, Esteve et al., 2020, Sagodi et al., 9 Feb 2026, Marion, 2023, Rácz et al., 2023, Jabir et al., 2019)