Papers
Topics
Authors
Recent
Search
2000 character limit reached

Minimax Rates for Score Estimation

Updated 17 January 2026
  • The paper establishes minimax rate bounds to quantify the sample complexity of recovering the score function under smoothness and tail constraints.
  • Regularized kernel plug-in and neural network estimators are shown to achieve these rates, balancing bias, variance, and regularization in practice.
  • Extensions to heavy-tailed and log-concave densities reveal key trade-offs and open challenges in nonparametric inference and score-based generative modeling.

Score estimation concerns the nonparametric statistical problem of recovering the score function, i.e., the gradient of the logarithm of the density, from independent samples of an unknown probability distribution. This estimation task is central in the analysis and implementation of modern score-based generative models, such as diffusion models, and is also pivotal in classical nonparametric analysis, semi-parametric efficiency, and shape-constrained inference. The minimax rate quantifies the fundamental sample complexity of score estimation as a function of the smoothness, tail properties, and other structural assumptions on the target density.

1. Formalization of the Score Estimation Problem

Given nn i.i.d.\ samples X1,,XnρX_1, \dots, X_n \sim \rho^* from an unknown probability density ρ\rho^* on Rd\mathbb{R}^d, the score function is defined as

s(x)=logρ(x).s^*(x) = \nabla \log \rho^*(x).

The prevailing risk metric is the mean integrated squared error with respect to the data-generating density:

L(s^;ρ)=Rds^(x)s(x)2ρ(x)dx.L(\hat s; \rho^*) = \int_{\mathbb{R}^d} \|\hat s(x) - s^*(x)\|^2\, \rho^*(x)\, dx.

The minimax risk over a class FF of densities is

Rn(F)=infs^supρFEX1nL(s^;ρ).R_n(F) = \inf_{\hat s} \sup_{\rho^* \in F} \mathbb{E}_{X_1^n} L(\hat s; \rho^*).

For smooth, subgaussian densities with a Lipschitz or Hölder score, FF may be defined as

  • Fα,L={ρ:ρ is α-subgaussian, s is L-Lipschitz},F_{\alpha, L} = \{ \rho: \rho\ \text{is}\ \alpha\text{-subgaussian},\ s^*\ \text{is}\ L\text{-Lipschitz} \},
  • Fα,β,L={ρ:ρ is α-subgaussian, s is β-Ho¨lder}F_{\alpha, \beta, L} = \{ \rho: \rho\ \text{is}\ \alpha\text{-subgaussian},\ s^*\ \text{is}\ \beta\text{-Hölder} \} for β1\beta \leq 1, where α\alpha-subgaussianity refers to for all θ\theta, Eeθ(XEX)eα2θ2/2\mathbb{E} e^{\theta^\top (X-\mathbb{E}X)} \leq e^{\alpha^2 \|\theta\|^2 / 2}.

2. Minimax Rate Characterization and Lower Bounds

The minimax rate in the standard setting is sharply characterized as follows:

  • For LL-Lipschitz (i.e., β=1\beta=1) scores:

Rn(Fα,L)=Θ(n2/(d+4))R_n(F_{\alpha, L}) = \Theta(n^{-2/(d+4)})

up to polylogarithmic factors in nn (Wibisono et al., 2024).

  • For β\beta-Hölder scores (0<β10 < \beta \leq 1):

Rn(Fα,β,L)=Θ(n2β/(d+2β+2))R_n(F_{\alpha, \beta, L}) = \Theta(n^{-2\beta/(d+2\beta+2)})

These rates are established using Fano-type arguments. For Lipschitz classes, the proof constructs a collection of perturbed Gaussian densities; the pairwise separation of their scores in squared L2(ρ)L^2(\rho)-distance scales as ϵ2\epsilon^2, while their Kullback-Leibler divergences scale as ϵ4\epsilon^4. Careful calibration gives the exponent in the lower bound: nϵ4logpackingn\epsilon^4 \ll \log |packing| with minimax risk scaling as n2/(d+4)n^{-2/(d+4)} (Wibisono et al., 2024).

The corresponding upper bounds are achieved by regularized kernel estimators and, in the context of generative modeling, empirical risk minimization over suitable function classes (e.g., neural networks) using denoising score matching objectives (Stéphanovitch et al., 7 Jul 2025).

3. Achievability: Estimators Attaining the Minimax Rate

A prototypical estimator is the regularized score plug-in derived from a Gaussian-kernel smoothed empirical measure μn\mu_n:

ρn,h(x)=(μnN(0,hId))(x),\rho_{n, h}(x) = (\mu_n * N(0, hI_d))(x),

with estimated score

s^h(x)=logρn,h(x)\hat s_h(x) = \nabla \log \rho_{n, h}(x)

with a suitable denominator regularization for numerical stability, e.g.,

s^h(x)=(xy)/h2Gh(xy)dμn(y)max{ρn,h(x),ε},\hat s_h(x) = \frac{ \int (x - y)/h^2 \cdot G_h(x - y) d\mu_n(y) }{ \max \{ \rho_{n, h}(x), \varepsilon \} },

where Gh(z)G_h(z) is the Gaussian kernel.

The estimation error decomposes into three components:

  1. Variance: [nhd/2]1[n h^{d/2}]^{-1} from kernel smoothing and Hellinger convergence.
  2. Bias: L2hdL^2 h d from kernel-induced smoothing.
  3. Regularization: O(logn)O(\log n) owing to denominator truncation.

Optimizing over the bandwidth hh by balancing variance and bias gives hn1/(d+4)h \asymp n^{-1/(d+4)} and an overall risk n2/(d+4)\asymp n^{-2/(d+4)} in the Lipschitz case (Wibisono et al., 2024).

Alternative approaches using neural network score estimators trained via denoising score matching also match the minimax rate (up to log factors), as established via approximation-theoretic arguments for Hölder classes (Stéphanovitch et al., 7 Jul 2025).

4. Extensions: Effect of Smoothness, Tails, and Shape Constraints

Smoothness: For β\beta-Hölder smooth (β1\beta \leq 1) scores, the minimax rate transitions to n2β/(d+2β+2)n^{-2\beta/(d+2\beta+2)}. If the underlying density is smoother or the score is more regular, faster rates can be achieved, but the above remains sharp when restricting to β1\beta \leq 1 (Wibisono et al., 2024).

Tail Behavior: Heavy-tailed target distributions alter minimax rates. For exponentially decaying tails, the minimax rate and the curse of dimensionality persist as in the subgaussian/light-tailed case. For polynomial tails of order γ\gamma, the minimax risk degrades to rate n(γ+1)/(d+γ+1)t(1+d/2)(γ+1)/(d+γ+1)n^{-(\gamma+1)/(d+\gamma+1)} t^{-(1+d/2)(\gamma+1)/(d+\gamma+1)} at smoothing scale tt (Yu et al., 10 Jan 2026). Efficient estimators require adapting both the truncation threshold and bandwidth to balance tail and bulk errors.

Shape Constraints: For log-concave densities under further Hölder smoothness of the log-density (with β[1,2]\beta \in [1,2]), the minimax rate for score estimation is L2/(2β+1)nβ/(2β+1)L^{2/(2\beta+1)} n^{-\beta/(2\beta+1)} up to polylogarithmic factors; this is faster than density or smoothness-only rates in the regime β<2\beta < 2 (Lewis et al., 16 Dec 2025). Control of the score's tail growth or quantile-based bounds is crucial to separation from the non-estimable regime where the risk remains constant.

5. Implications for Score-based Generative Modeling

The statistical bottleneck in diffusion and SGM algorithms is often the estimation of the time-dependent score function s(t,x)s^*(t, x) along the forward SDE. Precise error propagation typically shows that the Wasserstein or total variation distance between the generative model and the target distribution is upper bounded by an integral of the score estimation error. Consequently, the minimax sample complexity for achieving accuracy δ\delta is exponential in dd for subgaussian or exponentially-tailed targets:

nδ(d+4)/2n \asymp \delta^{-(d+4)/2}

for the canonical LL-Lipschitz class (Wibisono et al., 2024, Stéphanovitch et al., 7 Jul 2025).

Recent bounds for models trained with denoising score matching (via neural networks) and sampled using SDE or ODE integrators demonstrate that generative models similarly obey the minimax n(β+1)/(2β+d)n^{-(\beta+1)/(2\beta+d)} rate in Wasserstein or TV loss (up to log factors) (Stéphanovitch et al., 7 Jul 2025, Zhang et al., 2024). Nearly matching lower bounds are obtained for Sobolev/Hölder densities. For scores of smooth compactly-supported densities, recent analysis shows this rate can be achieved without early stopping or extraneous log terms (Dou et al., 2024).

A summary table of minimax rates for selected settings:

Target/Score Class Minimax Score Error Rate Notes
Subgaussian, Lipschitz score n2/(d+4)n^{-2/(d+4)} Curse of dimensionality (Wibisono et al., 2024)
Subgaussian, β\beta-Hölder n2β/(d+2β+2)n^{-2\beta/(d+2\beta+2)} β1\beta\leq 1 (Wibisono et al., 2024)
Heavy-tailed, exponential n1t1d/2n^{-1} t^{-1-d/2} tn2/(d+2β)t \gtrsim n^{-2/(d+2\beta)} (Yu et al., 10 Jan 2026)
Heavy-tailed, polynomial decay n(γ+1)/(d+γ+1)n^{-(\gamma+1)/(d+\gamma+1)} (at tt) Explicit γ\gamma dependence (Yu et al., 10 Jan 2026)
Log-concave, Hölder log-density L2/(2β+1)nβ/(2β+1)L^{2/(2\beta+1)} n^{-\beta/(2\beta+1)} β[1,2]\beta \in [1,2] (Lewis et al., 16 Dec 2025)

6. Technical Methods and Lower Bound Techniques

The proofs of both lower and upper bounds consistently exploit:

  • Construction of an explicit, finite class of densities (packings), often as local perturbations of a Gaussian or flat base measure, which maintain the required smoothness and tail conditions.
  • Fano's inequality to relate estimation risk to packing separation and Kullback-Leibler divergence, leveraging a separation in the L2(ρ)L^2(\rho) norm of the score function for risk lower bounds (Wibisono et al., 2024, Yu et al., 10 Jan 2026, Dou et al., 2024).
  • For upper bounds, analyses are based on kernel-based plug-in estimators, bias-variance decompositions, regularization schemes for low-density regions, and in neural network settings, covering number bounds and function approximation rates (Stéphanovitch et al., 7 Jul 2025, Lewis et al., 16 Dec 2025).

Specialized tools are applied for log-concave or shape-constrained settings, such as uniform confidence bands for the score built via local kernel smoothing, and inversion to produce adaptive, multiscale estimators (Lewis et al., 16 Dec 2025).

7. Broader Context and Current Open Questions

Minimax results for score estimation sharply extend classical nonparametric theory of density and derivative estimation into a setting tailored to the loss structure of score matching and SGM applications. In the heavy-tailed case, a qualitative dichotomy emerges: exponential tails allow the optimal light-tailed rates up to logarithmic factors, while polynomial tails result in explicit degradation indexed by the tail exponent (Yu et al., 10 Jan 2026). It remains open whether the derived sampling rate under polynomial tails is minimax optimal, particularly for the continuous reverse SDE setting.

Future directions include statistical limits for other forward processes (e.g., non-Gaussian diffusions or Lévy processes), minimax theory for discrete-time SGM algorithms, adaptive procedures for unknown smoothness or tail parameters, and the interaction of score estimation with model misspecification or domain constraints. The understanding of topology dependence in parametric score models for pairwise comparisons, as studied with respect to the Laplacian spectrum in the Bradley–Terry–Luce and Thurstone models, further reveals interaction between structural graph constraints and minimax error rates (Shah et al., 2015).

These developments clarify the fundamental sample efficiency limits for a key statistical primitive underlying contemporary generative modeling methodologies, and provide design principles for estimator construction and experimental design in high-dimensional nonparametric inference.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimax Rates for Score Estimation.