Second-Order Sobolev Training

Updated 22 January 2026

The paper’s main contribution is the formulation of a Sobolev loss that penalizes discrepancies in function values, gradients, and Hessians for higher-fidelity approximation.
It employs stochastic Hessian approximations to manage computational costs, ensuring the method remains scalable for large models.
Empirical evaluations reveal that while second-order training aims to enhance performance, challenges in hyperparameter tuning and limited gains over first-order methods are observed.

Second-order Sobolev training is a supervised learning framework that extends conventional neural network fitting to include not only the values of a target function but also its first and second derivatives with respect to the input. This methodology is motivated by the observation that in certain settings, such as scientific computing, ground-truth derivative information is either readily accessible or can be efficiently computed, and exploiting this information may lead to improved data efficiency, generalization, and physical consistency. The objective is to directly penalize mismatches not only in function values but also in the associated gradients (Jacobian) and curvatures (Hessian), targeting higher-fidelity approximation in the corresponding Sobolev space norm (Czarnecki et al., 2017, Schütz et al., 15 Jan 2026).

1. Theoretical Foundation: Sobolev Spaces and Loss Formulation

Let $\Omega \subset \mathbb{R}^n$ be compact. For integer $k \geq 0$ , the Sobolev space $W^{k,2}(\Omega)$ (denoted $S^k$ ) consists of functions $f \colon \Omega \rightarrow \mathbb{R}$ (or $\mathbb{R}^m$ ) whose weak derivatives up to order $k$ are square-integrable. The corresponding Sobolev norm is

$\|f\|_{S^k}^2 = \int_\Omega |f(x)|^2 dx + \sum_{j=1}^k \int_\Omega \|D^j f(x)\|_F^2 dx$

where $D^j f$ collects all $j$ -th order partial derivatives and $\|\cdot\|_F$ is the Frobenius norm.

In the context of training neural networks $m(x; \theta)$ , the empirical $k$ -th order Sobolev loss is formulated as

$L_{S,k}(\theta) := \sum_{i=1}^N \bigg[ \ell_0(m(x_i), f(x_i)) + \sum_{j=1}^k \ell_j(D^j m(x_i), D^j f(x_i)) \bigg]$

with $\ell_j$ typically chosen as squared error. For $k = 2$ , this specializes to matching values, gradients, and Hessians (Czarnecki et al., 2017).

2. Practical Computation: Hessian Penalties and Efficient Backpropagation

Computing full Hessian penalties explicitly incurs prohibitive $O(n^2)$ cost per output, where $n$ is the input dimension. To manage computational and memory costs, stochastic Sobolev penalties project Hessians along random unit directions $v \in S^{n-1}$ , using

$\mathbb{E}_v \| H_x m(x_i) \cdot v - H_x f(x_i) \cdot v \|^2$

This is estimated using automatic differentiation frameworks via two nested gradients: one to compute the Jacobian $G_i = \nabla_x m(x_i)$ , and a second on the projected gradient $\langle G_i, v \rangle$ to obtain the Hessian-vector product. This approach preserves scalability for large-scale models, resulting in an overall memory and time cost increase by a factor of roughly 1.5–2× compared to standard backpropagation (Czarnecki et al., 2017).

3. Canonical Objective: Second-order Sobolev Loss

The explicit form of the second-order Sobolev objective is

$L_{S,2}(\theta) = \sum_{i=1}^N \bigg[ \|m(x_i; \theta) - f(x_i)\|^2 + \lambda_1 \|\nabla_x m(x_i; \theta) - \nabla_x f(x_i)\|^2 + \lambda_2 \|H_x m(x_i; \theta) - H_x f(x_i)\|_F^2 \bigg]$

where $\lambda_1, \lambda_2 \geq 0$ are trade-off parameters and each term penalizes errors in the function value, gradient, and Hessian, respectively.

In engineering applications such as model order reduction for finite element simulations, the losses may be defined as half-mean-squared-errors (hMSE) on energy, force, and stiffness:

$\mathcal{L}(\theta) = w_E\,\mathcal{L}_E + w_F\,\mathcal{L}_F + w_K\,\mathcal{L}_K \ \mathcal{L}_E = \frac{1}{2 N_S} \sum_i (\hat e_i - e_i)^2 \ \mathcal{L}_F = \frac{1}{2N_S} \sum_i \|\hat f_{r,i} - f_{r,i}\|^2 \ \mathcal{L}_K = \frac{1}{2N_S} \sum_i \|\hat K_{r,i} - K_{r,i}\|_F^2 \$

with $w_E, w_F, w_K$ as static or dynamic weights (Schütz et al., 15 Jan 2026).

4. Empirical Performance and Applications

Czarnecki et al. demonstrate that when first-order Sobolev training ( $k=1$ ) is applied, significant improvements in data efficiency and generalization arise for regression, policy distillation, and synthetic-gradient tasks, especially in low-data regimes. Theoretically, ReLU networks are universal approximators in Sobolev spaces, i.e., for any $C^k$ target $f$ , one can achieve arbitrary accuracy in $\|\cdot\|_{S^k}$ norm. Sample complexity is provably reduced: for simple function classes, Sobolev supervision needs fewer samples than pure value matching ( $K_{sob} < K_{reg}$ ) (Czarnecki et al., 2017).

In model reduction for nonlinear finite element simulation, Schütz et al. applied second-order Sobolev training to a physics-augmented neural network (PANN) for a static, geometrically nonlinear cantilever beam test. Despite the theoretical expectation that including stiffness (Hessian) information would improve data-efficiency, generalization, and physically faithful derivatives, none of 22 tested variants of the second-order Sobolev loss outperformed the baseline ("force-only" training) in force matching error. In all variants, the best validation loss with max-based static weighting and Hessian ramp-in ( $w_E, w_F, w_K = 1, 9 \times 10^2, 9 \times 10^{-8}$ ) was still worse than the baseline ( $\sim 8 \times 10^{-6} \ \mathrm{N}$ vs. $\sim 3.3 \times 10^{-6} \ \mathrm{N}$ ). Dynamic weighting schemes and other loss balancing approaches also failed to improve upon the baseline (Schütz et al., 15 Jan 2026).

5. Optimization and Architectural Considerations

No architectural changes are strictly required for Sobolev training. For outputting energy and its derivatives, an input-convex neural network (ICNN) is combined with correction terms to ensure zero value and gradient at the origin:

$\hat e(x_r) = \mathrm{ICNN}(x_r) - \nabla\mathrm{ICNN}(0) \cdot x_r - \mathrm{ICNN}(0)$

All derivatives $\hat f = \nabla \hat e$ , $\hat K = \nabla^2 \hat e$ are computed via automatic differentiation. Loss scaling is critical; the three terms (energy, force, and stiffness) can differ by several orders of magnitude, requiring experimentation with static and dynamic weightings, loss ramping, or multi-objective gradient aggregation (e.g., using the “Jacobian-descent” package) (Schütz et al., 15 Jan 2026).

Standard practice includes input pre-standardization by a frozen linear layer to ensure each reduced coordinate has zero mean and unit variance, without impacting convexity guarantees.

6. Limitations, Extrapolation, and Observed Challenges

Although second-order Sobolev training offers theoretical advantages, empirical results by Schütz et al. indicate practical limitations in hyperreduction for nonlinear FEM. The Hessian-loss term was found to be numerically “inert” and challenging to fit, dominating the objective unless carefully reweighted, and did not align with the primary force-matching objective. Both force-only and Sobolev-trained PANN models exhibited catastrophic divergence during extrapolation (e.g., reversed loading), with Newton-Raphson solvers failing to converge outside the training range. The trajectory piecewise-linear (TPWL) baseline, while less accurate in force, remained robust across test cases.

The intended gains in interpolation accuracy and extrapolation stability by enforcing correct derivatives were not realized in these finite-element benchmarks. The inclusion of higher-order derivative terms increased training time, hyperparameter tuning complexity, and did not yield superior performance compared to first-order approaches (Schütz et al., 15 Jan 2026).

7. Summary Table: Loss Structure and Computational Aspects

Loss Term	Definition	Computational Aspect
Function value	$\\|m(x_i)-f(x_i)\\|^2$	Standard forward and backward pass
Jacobian (1st derivative)	$\\|\nabla_x m(x_i)-\nabla_x f(x_i)\\|^2$	Single additional gradient (autodiff)
Hessian (2nd derivative)	$\\|H_x m(x_i)-H_x f(x_i)\\|_F^2$ or stochastic proj.	Second gradient/Hessian-vector product; $O(n^2)$

In summary, second-order Sobolev training generalizes the classical objective by penalizing discrepancies up to the Hessian level but entails increased computational burden, sensitivity to loss weighting, and in practice may not always improve or robustify function approximation in complex settings, despite its strong theoretical motivation (Czarnecki et al., 2017, Schütz et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Sobolev Training for Neural Networks (2017)

Non-Intrusive Hyperreduction by a Physics-Augmented Neural Network with Second-Order Sobolev Training (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Second-Order Sobolev Training.