Weighted Empirical Risk Minimization

Updated 6 February 2026

Weighted Empirical Risk Minimization is a framework that applies nonnegative, data-dependent weights to each sample to tackle biases like covariate shift and temporal drift.
It incorporates various weighting schemes—including importance weighting, stratification, and adaptive weighting—to optimize performance under diverse distributional challenges.
The approach provides strong theoretical guarantees and has been empirically validated in applications such as survival analysis, policy evaluation, and federated learning.

Weighted Empirical Risk Minimization (Weighted ERM) generalizes the principle of empirical risk minimization by assigning nonnegative, typically data-dependent, weights to individual samples or groups of samples in the objective function. Weighted ERM frameworks provide a unified, tractable way to address diverse challenges in statistical learning, including covariate or label shift, temporal drift, adaptive collection, stratification, censoring, heteroscedasticity, data-dependence, privacy constraints, and fairness or robustness criteria. The mathematical structure, theoretical guarantees, and implementation details of weighted ERM are intimately tied to the source and distribution of the weights, which may be deterministic, estimated, or optimized within a constrained family.

1. Formal Definition and Foundational Structure

Weighted ERM seeks parameters $\theta \in \Theta$ of a predictor $f_\theta$ (or hypothesis $h$ ) by minimizing an objective of the form

$R_w(\theta) = \sum_{i=1}^n w_i\, \ell(\theta; x_i, y_i),$

where $(x_i, y_i)$ are training samples, $\ell$ is a pointwise loss function, and the weights $w = (w_1, ..., w_n)$ are nonnegative and often normalized (i.e., $\sum w_i = 1$ ) (Jeong et al., 17 Jul 2025, Vogel et al., 2020). This contrasts standard ERM, which takes $w_i = 1/n$ .

Alternative formulations include:

Population risk: $R^w(f) = \mathbb{E}_{Z \sim P}[ \omega(X) \ell(f; Z) ]$ for deterministic or estimated weights $\omega(\cdot)$ (Zhang et al., 4 Jan 2025).
Partitioned or distributed weights: ERM over multiple data sets or sources, $F(w) = \sum_{j=1}^m p_j f_j(w; D_j)$ , with $p_j \sim n_j / n$ matching data sizes per source (Kang et al., 2019).
Reweighted ERM by optimization: Weights $w$ are themselves optimized to minimize upper bounds on risk, conditional risk, or generalization error (Wang et al., 2017, Jeong et al., 17 Jul 2025).

Weighted ERM encompasses importance sampling (likelihood ratio weighting), output-dependent penalty weighting, closed-loop sample upweighting, and reweighting based on meta-optimization over distributional families.

2. Statistical Motivation and Theoretical Guarantees

Weighted ERM arises naturally in correction for sample selection bias, distribution shift, and in deriving oracle-efficient estimators on selected or non-i.i.d. regions (Vogel et al., 2020, Brock et al., 5 Feb 2026).

Importance weighting for covariate/label shift: When learning with biased or non-representative samples $(Z_i') \sim P'$ but targeting risk under $P$ , weighting samples by the Radon–Nikodym derivative $\Phi(z) = \frac{dP}{dP'}(z)$ yields the unbiased empirical risk estimator (Vogel et al., 2020). This construction guarantees that, under boundedness and mild estimation error of $\Phi$ , excess risk bounds inflate only multiplicatively with $\|\Phi\|_\infty$ .
Distributional drift and temporal adaptation: In nonstationary regimes, weighted ERM enables optimal tracking, with theoretical decompositions distinguishing learning error (estimation/statistical) and drift error (misspecification across evolving distributions) (Brock et al., 5 Feb 2026). The out-of-sample excess risk decomposes as a sum of a learning term (scaling with the effective sample size, e.g., $1/\|w\|_2^2$ ) and a drift term ( $\|h^*_w - h^*_{P_{n+1}}\|_{L_2}^2$ ).
Conditional risk improvement: Weighted ERM can tighten conditional risk bounds over particular sub-populations. Under a balanceable Bernstein condition (variance of weighted excess loss controlled by its mean), plugging in a well-estimated $\omega^*$ (margin or precision function) yields improvement by a factor in high-margin or low-variance regions versus the worst-case (global) rates attainable by unweighted ERM (Zhang et al., 4 Jan 2025).

Weighted ERM's excess risk guarantees, under appropriate mixing, margin, and complexity conditions, attain minimax-optimal rates for stationary problems and achieve optimal (or nearly optimal) adaptation in nonstationary or biased settings (Brock et al., 5 Feb 2026, Jeong et al., 17 Jul 2025, Wang et al., 2017).

3. Weighting Schemes: Origins and Algorithmic Realizations

The precise choice or estimation of weights underlies the strength and regime of weighted ERM. Notable constructions include:

Weight Type	Source and Methodology	Representative Application
Importance weights	$\Phi(z)=\frac{dP}{dP'}(z)$ , plug-in est.	Covariate/label shift, selection bias (Vogel et al., 2020, Bibaut et al., 2021)
Temporal/adaptive	Solution of convex program (e.g., RIDER)	Temporal drift, shifting regimes (Jeong et al., 17 Jul 2025, Brock et al., 5 Feb 2026)
Error-driven upweight	Closed-loop performance metric (CW-ERM)	Autonomous driving, failure targeting (Kumar et al., 2022)
Stratification	$w = \|{\rm stratum}\| / n$ or $p_k/p_k'$	Stratified population correction, balancing (Vogel et al., 2020, Myttenaere et al., 2015)
Censoring/Survival	$w_i=\delta_i / \hat S_C(\tilde Y_i\|X_i)$	Right-censored regression/survival (Ausset et al., 2019)
Client/partition size	$p_i = n_i / n$	Distributed/federated ERM (Kang et al., 2019)
Data-dependent balance	$\omega^*(x)$ (margin/precision)	Large-margin regions, heteroscedasticity (Zhang et al., 4 Jan 2025)
$f$ -divergence regular.	$w^* = Q_i \dot f^{-1}(-(β+\ell_i)/\lambda)$	Robustness, distributional constraints (Daunas et al., 19 Jan 2026)

Practical estimation of weights must control variance, avoid division by near-zero quantities, and ensure stability when rare strata or high-variance regions are present (Vogel et al., 2020, Myttenaere et al., 2015, Daunas et al., 19 Jan 2026). For instance, in censoring-adjusted ERM, leave-one-out conditional Kaplan–Meier weights are used to avoid self-weighting bias (Ausset et al., 2019).

Closed-loop schemes such as CW-ERM employ a train-identify-upweight-retrain pipeline—first identifying high-failure or rare-violation samples in a closed-loop simulation, then upweighting or upsampling them for subsequent training (Kumar et al., 2022).

Adaptive weighting for temporal data (e.g., in the RIDER methodology (Jeong et al., 17 Jul 2025)) solves a quadratic program balancing sampling variance against distributional drift using covariance and ARMA structure of the underlying process.

4. Optimization and Generalization Strategies

The optimization in weighted ERM is tractable for convex losses and regularizers, with standard solvers (stochastic gradient descent, QP solvers, etc.) supporting sample-wise weighting. For complex structured weights—particularly in networked, partitioned, or constrained scenarios—specific approximation algorithms (e.g., FPTAS for networked ERM (Wang et al., 2017)) are available, and duality arguments underpin weights derived from $f$ -divergence regularization (Daunas et al., 19 Jan 2026).

Generalization analysis leverages empirical process theory, Rademacher complexity, and chaining with respect to weighted function classes (Vogel et al., 2020, Bibaut et al., 2021). In adaptive or bandit data, new maximal inequalities exploiting importance sampling structure yield tight excess risk and regret bounds with explicit dependence on exploration (weight) profiles (Bibaut et al., 2021).

Selection of weights is often itself a regularization/meta-optimization procedure, e.g., choosing decay rates for exponential smoothing in nonstationary settings, optimizing quadratic programs for temporal drift (RIDER), or tuning weighting kernel parameters for stratification or censoring.

5. Canonical Applications and Empirical Findings

Weighted ERM has enabled significant advances across a spectrum of domains:

Sample bias and covariate/label shift: Weighted ERM with likelihood ratio correction demonstrates empirical improvement and consistency restoration in synthetic, stratified, and high-dimensional image datasets; bounds scale essentially as in the unbiased i.i.d. case provided the weighting function is accurately estimated and bounded (Vogel et al., 2020).
Survival/censored learning: IPCW weighting with Kaplan–Meier or Cox models restores correct risk estimation and achieves learning rates matching those in uncensored settings, given appropriate smoothness and boundedness conditions (Ausset et al., 2019).
Policy learning and off-policy evaluation: Importance-weighted ERM achieves minimax-optimal regret rates under decaying exploration in bandit-collected data, outperforming ad hoc stabilization heuristics for linear outcome models (Bibaut et al., 2021).
Temporal and nonstationary prediction: Weighted schemes (RIDER) empirically outperform pooling, last-block, and exponential baseline schemes, confirming the theoretical synthesis of these heuristics as special cases of an optimal trade-off (Jeong et al., 17 Jul 2025).
Closed-loop safety in imitation learning: CW-ERM, via selective upweighting of high-failure scenes, achieves substantial (30–40%) collision rate reductions in autonomous driving simulation, illustrating efficacy for real-world non-differentiable metrics (Kumar et al., 2022).
Distributed/federated learning: Properly normalized data-size weights improve sensitivity and excess risk in differentially private distributed ERM, with empirical accuracy matching centralized baselines as data imbalance grows (Kang et al., 2019).
Robustness and divergence control: $f$ -divergence regularization in ERM fits naturally into the weighted ERM paradigm, allowing explicit control over solution concentration and robustness properties via the form and strength of the divergence (Daunas et al., 19 Jan 2026).
Conditional improvement: Plug-in weighted ERM strictly improves bounds on large-margin or low-variance subregions, demonstrated both theoretically and via synthetic experiments (Zhang et al., 4 Jan 2025).

6. Limitations, Open Problems, and Extension Pathways

Weighted ERM's performance and guarantees depend critically on the quality and structure of the weights:

In settings with rare strata or vanishing denominators, weights may explode, causing variance inflation and unstable generalization (Vogel et al., 2020, Myttenaere et al., 2015).
Weight estimation from data can introduce additional estimation error, though linearization analyses show no asymptotic inefficiency arises in typical finite-dimensional regimes (Vogel et al., 2020).
For highly non-i.i.d. or dependent data (networked observations, time series), estimation of appropriate weights and generalization bounds requires more intricate dependence-aware analysis (Wang et al., 2017, Brock et al., 5 Feb 2026).
Sample splitting for weight estimation is recommended for oracle-style conditional guarantees but may require careful balancing to avoid bias or inefficiency (Zhang et al., 4 Jan 2025).
Tuning for nonstationarity involves balancing drift-tracking with estimation variance, typically via cross-validation or online error monitoring (Brock et al., 5 Feb 2026, Jeong et al., 17 Jul 2025).

Open directions include:

Extension to online or adaptively updated weighted ERM frameworks that dynamically adjust to observed drift or distributional change (Jeong et al., 17 Jul 2025, Brock et al., 5 Feb 2026).
Incorporation of more general or structured $f$ -divergence penalties for robust learning under adversarial or worst-case uncertainty (Daunas et al., 19 Jan 2026).
Extensions to high-dimensional, few-shot, or rare-event regimes via hierarchical, Bayesian, or meta-learning approaches to weight regularization.
Characterization of optimal weighting for weakly dependent or non-mixing time series, structured prediction, or higher-order interactions in networked data (Wang et al., 2017).

Weighted ERM remains central in modern statistical learning, serving as a theoretically grounded and empirically validated framework for principled adaptation to structural biases, dependence, drift, and resource constraints in increasingly complex data modalities.