Minimax Rates for Score Estimation

Updated 17 January 2026

The paper establishes minimax rate bounds to quantify the sample complexity of recovering the score function under smoothness and tail constraints.
Regularized kernel plug-in and neural network estimators are shown to achieve these rates, balancing bias, variance, and regularization in practice.
Extensions to heavy-tailed and log-concave densities reveal key trade-offs and open challenges in nonparametric inference and score-based generative modeling.

Score estimation concerns the nonparametric statistical problem of recovering the score function, i.e., the gradient of the logarithm of the density, from independent samples of an unknown probability distribution. This estimation task is central in the analysis and implementation of modern score-based generative models, such as diffusion models, and is also pivotal in classical nonparametric analysis, semi-parametric efficiency, and shape-constrained inference. The minimax rate quantifies the fundamental sample complexity of score estimation as a function of the smoothness, tail properties, and other structural assumptions on the target density.

1. Formalization of the Score Estimation Problem

Given $n$ i.i.d.\ samples $X_1, \dots, X_n \sim \rho^*$ from an unknown probability density $\rho^*$ on $\mathbb{R}^d$ , the score function is defined as

$s^*(x) = \nabla \log \rho^*(x).$

The prevailing risk metric is the mean integrated squared error with respect to the data-generating density:

$L(\hat s; \rho^*) = \int_{\mathbb{R}^d} \|\hat s(x) - s^*(x)\|^2\, \rho^*(x)\, dx.$

The minimax risk over a class $F$ of densities is

$R_n(F) = \inf_{\hat s} \sup_{\rho^* \in F} \mathbb{E}_{X_1^n} L(\hat s; \rho^*).$

For smooth, subgaussian densities with a Lipschitz or Hölder score, $F$ may be defined as

$F_{\alpha, L} = \{ \rho: \rho\ \text{is}\ \alpha\text{-subgaussian},\ s^*\ \text{is}\ L\text{-Lipschitz} \},$
$X_1, \dots, X_n \sim \rho^*$ 0 for $X_1, \dots, X_n \sim \rho^*$ 1, where $X_1, \dots, X_n \sim \rho^*$ 2-subgaussianity refers to for all $X_1, \dots, X_n \sim \rho^*$ 3, $X_1, \dots, X_n \sim \rho^*$ 4.

2. Minimax Rate Characterization and Lower Bounds

The minimax rate in the standard setting is sharply characterized as follows:

For $X_1, \dots, X_n \sim \rho^*$ 5-Lipschitz (i.e., $X_1, \dots, X_n \sim \rho^*$ 6) scores:

$X_1, \dots, X_n \sim \rho^*$ 7

up to polylogarithmic factors in $X_1, \dots, X_n \sim \rho^*$ 8 (Wibisono et al., 2024).

For $X_1, \dots, X_n \sim \rho^*$ 9-Hölder scores ( $\rho^*$ 0):

$\rho^*$ 1

These rates are established using Fano-type arguments. For Lipschitz classes, the proof constructs a collection of perturbed Gaussian densities; the pairwise separation of their scores in squared $\rho^*$ 2-distance scales as $\rho^*$ 3, while their Kullback-Leibler divergences scale as $\rho^*$ 4. Careful calibration gives the exponent in the lower bound: $\rho^*$ 5 with minimax risk scaling as $\rho^*$ 6 (Wibisono et al., 2024).

The corresponding upper bounds are achieved by regularized kernel estimators and, in the context of generative modeling, empirical risk minimization over suitable function classes (e.g., neural networks) using denoising score matching objectives (Stéphanovitch et al., 7 Jul 2025).

3. Achievability: Estimators Attaining the Minimax Rate

A prototypical estimator is the regularized score plug-in derived from a Gaussian-kernel smoothed empirical measure $\rho^*$ 7:

$\rho^*$ 8

with estimated score

$\rho^*$ 9

with a suitable denominator regularization for numerical stability, e.g.,

$\mathbb{R}^d$ 0

where $\mathbb{R}^d$ 1 is the Gaussian kernel.

The estimation error decomposes into three components:

Variance: $\mathbb{R}^d$ 2 from kernel smoothing and Hellinger convergence.
Bias: $\mathbb{R}^d$ 3 from kernel-induced smoothing.
Regularization: $\mathbb{R}^d$ 4 owing to denominator truncation.

Optimizing over the bandwidth $\mathbb{R}^d$ 5 by balancing variance and bias gives $\mathbb{R}^d$ 6 and an overall risk $\mathbb{R}^d$ 7 in the Lipschitz case (Wibisono et al., 2024).

Alternative approaches using neural network score estimators trained via denoising score matching also match the minimax rate (up to log factors), as established via approximation-theoretic arguments for Hölder classes (Stéphanovitch et al., 7 Jul 2025).

4. Extensions: Effect of Smoothness, Tails, and Shape Constraints

Smoothness: For $\mathbb{R}^d$ 8-Hölder smooth ( $\mathbb{R}^d$ 9) scores, the minimax rate transitions to $s^*(x) = \nabla \log \rho^*(x).$ 0. If the underlying density is smoother or the score is more regular, faster rates can be achieved, but the above remains sharp when restricting to $s^*(x) = \nabla \log \rho^*(x).$ 1 (Wibisono et al., 2024).

Tail Behavior: Heavy-tailed target distributions alter minimax rates. For exponentially decaying tails, the minimax rate and the curse of dimensionality persist as in the subgaussian/light-tailed case. For polynomial tails of order $s^*(x) = \nabla \log \rho^*(x).$ 2, the minimax risk degrades to rate $s^*(x) = \nabla \log \rho^*(x).$ 3 at smoothing scale $s^*(x) = \nabla \log \rho^*(x).$ 4 (Yu et al., 10 Jan 2026). Efficient estimators require adapting both the truncation threshold and bandwidth to balance tail and bulk errors.

Shape Constraints: For log-concave densities under further Hölder smoothness of the log-density (with $s^*(x) = \nabla \log \rho^*(x).$ 5), the minimax rate for score estimation is $s^*(x) = \nabla \log \rho^*(x).$ 6 up to polylogarithmic factors; this is faster than density or smoothness-only rates in the regime $s^*(x) = \nabla \log \rho^*(x).$ 7 (Lewis et al., 16 Dec 2025). Control of the score's tail growth or quantile-based bounds is crucial to separation from the non-estimable regime where the risk remains constant.

5. Implications for Score-based Generative Modeling

The statistical bottleneck in diffusion and SGM algorithms is often the estimation of the time-dependent score function $s^*(x) = \nabla \log \rho^*(x).$ 8 along the forward SDE. Precise error propagation typically shows that the Wasserstein or total variation distance between the generative model and the target distribution is upper bounded by an integral of the score estimation error. Consequently, the minimax sample complexity for achieving accuracy $s^*(x) = \nabla \log \rho^*(x).$ 9 is exponential in $L(\hat s; \rho^*) = \int_{\mathbb{R}^d} \|\hat s(x) - s^*(x)\|^2\, \rho^*(x)\, dx.$ 0 for subgaussian or exponentially-tailed targets:

$L(\hat s; \rho^*) = \int_{\mathbb{R}^d} \|\hat s(x) - s^*(x)\|^2\, \rho^*(x)\, dx.$ 1

for the canonical $L(\hat s; \rho^*) = \int_{\mathbb{R}^d} \|\hat s(x) - s^*(x)\|^2\, \rho^*(x)\, dx.$ 2-Lipschitz class (Wibisono et al., 2024, Stéphanovitch et al., 7 Jul 2025).

Recent bounds for models trained with denoising score matching (via neural networks) and sampled using SDE or ODE integrators demonstrate that generative models similarly obey the minimax $L(\hat s; \rho^*) = \int_{\mathbb{R}^d} \|\hat s(x) - s^*(x)\|^2\, \rho^*(x)\, dx.$ 3 rate in Wasserstein or TV loss (up to log factors) (Stéphanovitch et al., 7 Jul 2025, Zhang et al., 2024). Nearly matching lower bounds are obtained for Sobolev/Hölder densities. For scores of smooth compactly-supported densities, recent analysis shows this rate can be achieved without early stopping or extraneous log terms (Dou et al., 2024).

A summary table of minimax rates for selected settings:

Target/Score Class	Minimax Score Error Rate	Notes
Subgaussian, Lipschitz score	$L(\hat s; \rho^) = \int_{\mathbb{R}^d} \\|\hat s(x) - s^(x)\\|^2\, \rho^*(x)\, dx.$ 4	Curse of dimensionality (Wibisono et al., 2024)
Subgaussian, $L(\hat s; \rho^) = \int_{\mathbb{R}^d} \\|\hat s(x) - s^(x)\\|^2\, \rho^*(x)\, dx.$ 5-Hölder	$L(\hat s; \rho^) = \int_{\mathbb{R}^d} \\|\hat s(x) - s^(x)\\|^2\, \rho^*(x)\, dx.$ 6	$L(\hat s; \rho^) = \int_{\mathbb{R}^d} \\|\hat s(x) - s^(x)\\|^2\, \rho^*(x)\, dx.$ 7 (Wibisono et al., 2024)
Heavy-tailed, exponential	$L(\hat s; \rho^) = \int_{\mathbb{R}^d} \\|\hat s(x) - s^(x)\\|^2\, \rho^*(x)\, dx.$ 8	$L(\hat s; \rho^) = \int_{\mathbb{R}^d} \\|\hat s(x) - s^(x)\\|^2\, \rho^*(x)\, dx.$ 9 (Yu et al., 10 Jan 2026)
Heavy-tailed, polynomial decay	$F$ 0 (at $F$ 1)	Explicit $F$ 2 dependence (Yu et al., 10 Jan 2026)
Log-concave, Hölder log-density	$F$ 3	$F$ 4 (Lewis et al., 16 Dec 2025)

6. Technical Methods and Lower Bound Techniques

The proofs of both lower and upper bounds consistently exploit:

Construction of an explicit, finite class of densities (packings), often as local perturbations of a Gaussian or flat base measure, which maintain the required smoothness and tail conditions.
Fano's inequality to relate estimation risk to packing separation and Kullback-Leibler divergence, leveraging a separation in the $F$ 5 norm of the score function for risk lower bounds (Wibisono et al., 2024, Yu et al., 10 Jan 2026, Dou et al., 2024).
For upper bounds, analyses are based on kernel-based plug-in estimators, bias-variance decompositions, regularization schemes for low-density regions, and in neural network settings, covering number bounds and function approximation rates (Stéphanovitch et al., 7 Jul 2025, Lewis et al., 16 Dec 2025).

Specialized tools are applied for log-concave or shape-constrained settings, such as uniform confidence bands for the score built via local kernel smoothing, and inversion to produce adaptive, multiscale estimators (Lewis et al., 16 Dec 2025).

7. Broader Context and Current Open Questions

Minimax results for score estimation sharply extend classical nonparametric theory of density and derivative estimation into a setting tailored to the loss structure of score matching and SGM applications. In the heavy-tailed case, a qualitative dichotomy emerges: exponential tails allow the optimal light-tailed rates up to logarithmic factors, while polynomial tails result in explicit degradation indexed by the tail exponent (Yu et al., 10 Jan 2026). It remains open whether the derived sampling rate under polynomial tails is minimax optimal, particularly for the continuous reverse SDE setting.

Future directions include statistical limits for other forward processes (e.g., non-Gaussian diffusions or Lévy processes), minimax theory for discrete-time SGM algorithms, adaptive procedures for unknown smoothness or tail parameters, and the interaction of score estimation with model misspecification or domain constraints. The understanding of topology dependence in parametric score models for pairwise comparisons, as studied with respect to the Laplacian spectrum in the Bradley–Terry–Luce and Thurstone models, further reveals interaction between structural graph constraints and minimax error rates (Shah et al., 2015).

These developments clarify the fundamental sample efficiency limits for a key statistical primitive underlying contemporary generative modeling methodologies, and provide design principles for estimator construction and experimental design in high-dimensional nonparametric inference.