Nonlinear Approximation Characteristics

Updated 10 January 2026

Nonlinear approximation is a method for approximating functions using adaptive, data-driven selections from flexible dictionaries that yield superior accuracy compared to linear methods.
It leverages compositions and deep architectures, such as ReLU networks, to achieve exponential or rate-doubled convergence under certain smoothness assumptions.
Applications include efficient high-dimensional PDE solvers, stable data assimilation, and compressive sensing where stability and optimal error decay are crucial.

Nonlinear approximation characteristics encompass the theory, methodologies, and performance metrics associated with approximating functions or operators via mappings that are nonlinear in their parameters, selection, or construction. In contrast to classical (linear) approximation, which utilizes fixed subspaces or bases, nonlinear approximation exploits the flexibility of adaptive or data-driven choices, compositions, and other nonlinear mechanisms to achieve superior accuracy for a wider range of target functions and under broader model assumptions.

1. Definitions and Metrics in Nonlinear Approximation

A central framework is the best $N$ -term nonlinear approximation, where, for a function $f$ and a "dictionary" $\mathcal D$ of admissible functions, the goal is to approximate $f$ as a linear combination of $N$ elements optimally chosen from $\mathcal D$ :

$\varepsilon_{L,f}(N):=\min_{\{g_n\}\subset\mathbb{R},\,\{T_n\}\subset\mathcal{D}} \left\|f(x) - \sum_{n=1}^N g_n T_n(x)\right\|$

The metric of interest is how $\varepsilon_{L,f}(N)$ decays as $N$ increases, reflecting the approximation efficiency. Typical choices of $\mathcal{D}$ include bases or frames (e.g., wavelets, splines, kernels, neural network parameterizations), and the "nonlinear" aspect refers to optimizing the selection of these $f$ 0 terms for each target $f$ 1, rather than committing to a fixed collection (Shen et al., 2019).

For multivariate and tensor-product settings, similar constructs arise (e.g., best $f$ 2-term tensor product approximations (Bazarkhanov et al., 2014)).

2. Fundamental Theorem Classes and Rate Results

2.1 Classical and Kernels-Based Best $f$ 3-Term Rates

For many canonical function spaces (e.g., Sobolev, Triebel–Lizorkin, Besov), best $f$ 4-term approximation with suitably regular kernel families, wavelets, or splines satisfies (Hamm et al., 2016):

$f$ 5

where $f$ 6 of $f$ 7 kernel terms, $f$ 8 is the smoothness parameter, and $f$ 9 the ambient dimension.

2.2 Composition and Deep Learning Regimes

When dictionary elements are compositions—especially, e.g., compositional neural networks with $\mathcal D$ 0 layers—depth can dramatically accelerate best $\mathcal D$ 1-term rates:

Depth-1 (shallow): $\mathcal D$ 2
Depth-2: $\mathcal D$ 3, exponent doubles
Depth-3 and higher: For Hölder $\mathcal D$ 4 on $\mathcal D$ 5, $\mathcal D$ 6 is optimal; extra layers ( $\mathcal D$ 7) give no further gain (Shen et al., 2019).

For one-dimensional $\mathcal D$ 8 with only continuity (not smoothness), doubling of rate at $\mathcal D$ 9 still occurs.

2.3 Superiority of Deep ReLU Networks

Certain function classes exhibit exponential Shannon-type rates for deep ReLU networks; Takagi-type and self-similar functions satisfy $f$ 0, while best spline or wavelet dictionaries deteriorate to polynomial rates (Daubechies et al., 2019).

2.4 Nonlinear Tensor Product Approximation

For $f$ 1 in periodic mixed-smoothness class $f$ 2 on $f$ 3 variables:

$f$ 4

Upper and lower bounds match up to logarithmic factors; constructive greedy algorithms attain similar exponents (Bazarkhanov et al., 2014).

2.5 Restricted and Weighted Approximation

The introduction of general measures $f$ 5 for restricted $f$ 6-term approximation leads to a full characterization of approximation spaces using weighted Lorentz sequence spaces and the upper/lower Temlyakov property, unifying Jackson and Bernstein inequalities with approximation embeddings (Hernández et al., 2011).

3. Methodological Frameworks

3.1 Compositional Dictionaries

Composition of shallow networks or blocks leads to improved expressivity. The optimal rate-doubling ("L=2") and rate saturation ("L=3") phenomena are both tied to the combinatorial growth of the function landscape under composition and tiling: e.g., in $f$ 7 dimensions, tiling requires $f$ 8 cubes, each attaining $f$ 9 local error for Hölder $N$ 0 (Shen et al., 2019).

3.2 Kernel and Approximation Families

Regular families of decaying or growing kernels—encompassing Gaussians, multiquadrics, cardinal functions—can be systematically analyzed for their $N$ 1-term rates by verifying translation, dilation, and Poisson summation properties (hypotheses (A1)-(A6)) (Hamm et al., 2016). These kernels enable nonlinear spaces that match the performance of best wavelet expansions.

3.3 Greedy and Library-Based Schemes

For high-dimensional parametric PDEs and analytic function classes with anisotropy, adaptive library-based piecewise Taylor approximations subdivide the parameter space and select local low-dimensional spaces for each cell, achieving quantifiable error rates depending logarithmically or subexponentially on the error tolerance, breaking or mitigating the curse of dimensionality, depending on the anisotropy decay (Guignard et al., 2022).

3.4 Choquet and Nonlinear Integral Operators

Nonlinear extension of classical constructive schemes via the Choquet integral leads to Bernstein–Choquet and Picard–Choquet operators. These exhibit improved rates for certain function classes (monotone/concave or exponentials), outperforming classical linear positive operators in those regimes (Gal, 2014).

3.5 Algorithmic and Computational Aspects

Many nonlinear approximation methods, particularly those involving nonconvex selection or parameter search (e.g., kernel parameter grids, nonnegative least squares for rational/exponential approximations (Vabishchevich, 2023)), employ iterative or greedy algorithms. Effective discretization, active-set NNLS, and QR-based stabilization are standard techniques; provable convergence properties may be lacking in fully nonlinear parameter regimes.

4. Stability, Manifold Widths, and Optimality

Realistic nonlinear approximation must account for numerical stability, most prominently captured by the notion of stable manifold widths $N$ 2. These widths are intimately connected to the entropy numbers $N$ 3 measuring the compactness of $N$ 4. Fundamental consequences (Cohen et al., 2020):

For Hilbert spaces, stable widths and entropy numbers are equivalent up to constants.
In Banach spaces, enforcing $N$ 5-Lipschitz continuity in encoder/decoder bounds the possible approximation rates by entropy—precluding "faster-than-entropy" rates.
For unit Lipschitz-bounded function classes (e.g., Lip $N$ 6([0,1])), enforcing stability forces O( $N$ 7) error decay, even as unstable approximations (deep ReLU nets with arbitrary parameterization) can obtain O( $N$ 8).

5. Specialized and Emerging Regimes

5.1 Quadratic and Algebraic Manifolds

The quadratic formula–based degree-2 nonlinear approximation constructs closed-form smooth coefficient manifolds to represent single-variable functions as roots of degree-2 polynomials with a learned index function for branch selection. This yields global exponential convergence across discontinuities (unlike linear/rational schemes), as the algebraic variety encodes jumps sharply and enables effective edge-preserving denoising (He et al., 6 Dec 2025).

5.2 Piecewise-Affine and Cut-Based Schemes

Multi-dimensional nonlinear functions can be efficiently approximated by iteratively partitioning the domain using hinging hyperplanes and fitting local affine surrogates (PWA). Adaptive cut-selection, continuity enforcement, and region complexity increase only as needed, attaining accuracy with many fewer regions compared to mesh-recursive baselines (Gharavi et al., 2024).

5.3 Recurrent and Sequence Models

Nonlinear RNN approximation is fundamentally limited by a Bernstein-type inverse theorem: stably approximable sequence-to-sequence maps must have exponentially decaying memory kernels, generalizing the "curse of memory" from linear to nonlinear architectures. Overcoming this requires Hurwitz-parmeterized recurrent matrices to stably represent slow memory decay (Wang et al., 2023).

6. Practical Implications and Applications

For Hölder or Sobolev targets on $N$ 9, compositional deep networks with moderate width and depth achieve best-known nonlinear rates, and extra depth offers no further gain (Shen et al., 2019).
In kernel regimes, nonlinear $\mathcal D$ 0-term kernel spaces achieve wavelet-optimal rates, and cardinal interpolation yields powerful greedy truncated or adaptive sampling-based approximations.
Library-based partitioning enables scalable surrogates for high-dimensional parametric models (PDEs, uncertainty quantification), with complexity scaling dictated by analytic anisotropy parameters (Guignard et al., 2022).
For models requiring stability (data assimilation, numerical PDEs, compressed sensing), achievable rates must be benchmarked via entropy or stable manifold widths, not by the raw performance of unconstrained parametrizations (Cohen et al., 2020).

7. Open Problems and Outlook

Open questions include the development of optimal or near-minimal algorithms for coefficient construction in nonlinear/algebraic manifold representations, understanding the precise role of Lipschitz stability across architectures, effective index/function encoding in high-dimensional or multi-valued contexts, and rigorous convergence guarantees for adaptive piecewise or greedy parameter selection schemes. The theory continues to evolve with advances in neural and kernel architectures, high-dimensional surrogate modeling, and algorithmic stability under data and parameter perturbations.

References by arXiv id:

Nonlinear Approximation via Compositions (Shen et al., 2019)
Nonlinear Approximation and (Deep) ReLU Networks (Daubechies et al., 2019)
Regular Families of Kernels for Nonlinear Approximation (Hamm et al., 2016)
Nonlinear approximation of functions based on non-negative least squares solver (Vabishchevich, 2023)
Nonlinear tensor product approximation of functions (Bazarkhanov et al., 2014)
Nonlinear approximation of high-dimensional anisotropic analytic functions (Guignard et al., 2022)
Optimal Stable Nonlinear Approximation (Cohen et al., 2020)
Quadratic Formula-based Nonlinear Approximation (He et al., 6 Dec 2025)
Approximation by Choquet Integral Operators (Gal, 2014)
Iterative Cut-Based PWA Approximation (Gharavi et al., 2024)
Inverse Approximation Theory for Nonlinear RNNs (Wang et al., 2023)
Projection-Based Finite Elements for Nonlinear Function Spaces (Grohs et al., 2018)