Empirical Likelihood Estimation under CMR

Updated 12 January 2026

The paper introduces a semiparametric framework that casts estimation as an infinite-dimensional empirical likelihood problem to achieve the efficiency bound.
It employs RKHS, neural networking, and sieve-based approximations to rigorously enforce conditional moment restrictions.
Empirical studies demonstrate that these methods outperform traditional estimators by significantly reducing mean squared error.

Empirical-likelihood (EL) estimators under conditional moment restrictions (CMR) comprise a foundational framework for inference in semiparametric econometrics, statistical machine learning, and causal inference. The central insight is that, in models identified by conditional moments, estimation can be cast as an infinite-dimensional empirical-likelihood problem, leading to procedures that achieve efficiency bounds, enjoy robust small-sample properties, and take advantage of function-approximation frameworks such as reproducing kernel Hilbert spaces (RKHS) and neural networks. Below, the mathematical setup, principal methodologies, asymptotic properties, and implementation details are developed, referencing central developments (Kremer et al., 2022, Chaumaray et al., 2020, Kremer et al., 2023, Chib et al., 2021).

1. Statistical Formulation and Conditional Moment Restrictions

Suppose $\{(X_i,Z_i)\}_{i=1}^n$ are i.i.d. draws from an unknown law $P_{X,Z}$ , $\theta \in \Theta \subseteq \mathbb{R}^p$ is the finite-dimensional parameter of interest, and $\psi: \mathcal{X} \times \Theta \to \mathbb{R}^m$ is a prescribed moment function. The conditional moment restriction stipulates: $\mathbb{E}[\psi(X;\theta_0) \mid Z] = 0 \quad P_Z\text{-a.s.}$ for a unique $\theta_0 \in \Theta$ . This model class generalizes classical mean regression and instrumental variable settings, incorporating nonparametric or semiparametric nuisance components as needed (Kremer et al., 2022, Chib et al., 2021).

A key equivalence, via the law of iterated expectations, is: $\mathbb{E}\left[w(Z)\psi(X;\theta_0)\right]=0, \quad \forall w:\mathcal{Z}\to\mathbb{R}^m$ yielding an infinite system of unconditional moment restrictions, indexed by test functions $w$ . The solution set can be abstractly represented as vanishing of a functional in the dual of a Hilbert space $\mathcal{H}$ : $E_{P_0}\left[\Psi(X,Z;\theta_0)\right] = 0, \quad \text{with } \Psi(X,Z;\theta)[h] = \psi(X;\theta)^{\top} h(Z),\;\forall h \in \mathcal{H}$ (Kremer et al., 2022, Kremer et al., 2023).

2. Functional Generalized Empirical Likelihood Framework

Generalized empirical likelihood (GEL) seeks an alternative probability measure $P \ll \widehat{P}_n = n^{-1} \sum_{i=1}^n \delta_{(X_i,Z_i)}$ that (i) strictly enforces the continuum of moment constraints and (ii) incurs minimal divergence from the empirical measure. For a convex function $\varphi$ generating the divergence $D_\varphi$ , one solves: $R(\theta) = \inf_{P\ll \widehat{P}_n} \bigg\{ D_\varphi(P \Vert \widehat{P}_n) \;\Big|\; E_P[\Psi(X,Z;\theta)] = 0 \bigg\}$ For the original empirical likelihood, $\varphi(p) = -2\log p$ , the primal problem is: $\min_{\{p_i\}:\sum p_i=1,\,p_i\geq 0} \sum_{i=1}^n -2\log(n p_i) \quad \text{s.t.} \quad \sum_{i=1}^n p_i\psi(X_i;\theta)^{\top} h(Z_i) = 0\;\forall h\in\mathcal{H}$ where the constraints span an infinite-dimensional space (Kremer et al., 2022).

The dual emerges by introducing Lagrange multipliers $\lambda:\mathcal{Z} \to \mathbb{R}^m$ : $R(\theta) = \sup_{\lambda(\cdot)} \left\{ -\frac{1}{n}\sum_{i=1}^n \log\left(1+\lambda(Z_i)^{\top}\psi(X_i;\theta)\right) \right\}$ Possibly with an RKHS-norm or $L_2$ regularizer on $\lambda$ . For general $\varphi$ , the dual takes the form: $\sup_{h\in\mathcal{H}} \left\{ -\frac{1}{n} \sum_{i=1}^n \varphi^*(\Psi_i(\theta)[h]) - \lambda \|h\|_{\mathcal{H}} \right\}$ where $\varphi^*$ is the convex conjugate, and for EL $\phi(v) = \log(1-v)$ (Kremer et al., 2022).

3. Asymptotic Properties and Efficiency

Under compactness of $\Theta$ , continuity of $\psi$ , non-singularity of

$\Omega_0 = E[\Psi \otimes \Psi],\quad \Sigma_0 = \langle E[\nabla_\theta \Psi],\,E[\nabla_\theta \Psi]\rangle_{\mathcal{H}^*}$

and uniform Donsker conditions on the class $\Psi(\cdot;\theta)(h)$ , one has:

Consistency:

$\hat{\theta} = \arg\min_\theta \sup_{h\in\mathcal{H}} \left\{ \frac{1}{n}\sum_i \phi(\psi(X_i;\theta)^\top h(Z_i)) - \frac{\lambda_n}{2}\|h\|_{\mathcal{H}}^2 \right\} \xrightarrow{p} \theta_0$

with $\lambda_n \to 0$ at rate $O(n^{-\xi})$ , $\xi < 1/2$ .

Asymptotic normality:

$\sqrt{n}(\hat{\theta}-\theta_0) \overset{d}{\longrightarrow} N(0,\Sigma_\theta),\quad \Sigma_\theta = \left(\nabla_\theta\Psi_0\,\Omega_0^{-1}\,\nabla_\theta\Psi_0^*\right)^{-1}$

coincident with the semiparametric efficiency bound of Chamberlain (1987) (Kremer et al., 2022, Kremer et al., 2023, Chib et al., 2021).

In settings where sieve-based or kernel-based approximations are used, the correct growth rate of the sieve dimension (e.g., $k_n = o(n^{1/6})$ under correct specification) is necessary to guarantee efficiency (Chib et al., 2021).

4. Solution Strategies and Computation

RKHS-based implementation:

Let $\mathcal{H}$ be the RKHS of a universal, strictly positive-definite kernel $k$ on $\mathcal{Z}$ . By the representer theorem, the maximizer $h^*$ has the form:

$h(z) = \sum_{j=1}^n \alpha_j k(z_j, z)$

Reducing the infinite-dimensional optimization over $h$ to a finite problem in $\alpha \in \mathbb{R}^n$ . Algorithmic steps include alternating or simultaneous maximization over $\alpha$ and minimization over $\theta$ (using, e.g., LBFGS), leveraging Danskin's theorem for gradient computations (Kremer et al., 2022, Kremer et al., 2023).

Neural network-based implementation:

Parametrize the dual function $\lambda(z) = h_\omega(z)$ by a feed-forward neural network. The EL criterion becomes:

$\min_{\theta}\;\max_{\omega}\; \frac{1}{n}\sum_{i=1}^n \phi(\psi(X_i;\theta)^\top h_\omega(Z_i)) - \frac{\lambda_n}{2n}\sum_{i=1}^n \|h_\omega(Z_i)\|^2$

Training employs stochastic min-max solvers suited for nonconvex-concave games (e.g., Optimistic Adam) (Kremer et al., 2022).

Sieve-based and ETEL approach:

Approximate the CMR via finite sieves of basis functions $\{\varphi_j(z)\}_{j=1}^{k_n}$ , expanding unconditional moments as $g_i(\theta) = [\varphi_1(Z_i)\psi(X_i,\theta),..., \varphi_{k_n}(Z_i)\psi(X_i,\theta)]'$ . Optimization proceeds via Newton or quasi-Newton solvers in the inner loop (for dual parameters) and standard optimizers in the outer loop (for $\theta$ ) (Chib et al., 2021).

5. Key Variants and Theoretical Extensions

Kernel Method of Moments (KMM):

KMM replaces the divergence penalty in the GEL functional by a maximum mean discrepancy (MMD) between a candidate law and the empirical law, together with an entropy regularization term. This allows candidate distributions to place mass "off" the empirical data, yielding:

$R_\epsilon^\varphi(\theta) = \inf_{P \ll \omega} \frac{1}{2}\MMD^2(P, \hat{P}_n; \mathcal{F}) + \epsilon D_\varphi(P \|\omega) \quad \text{s.t. moments as above}$

Dual representations, representer-theorem reductions, and practical stochastic gradient algorithms are employed. KMM achieves semiparametric efficiency and offers flexibility beyond data reweighting approaches (Kremer et al., 2023).

Dependent Data and Semiparametric Models:

In stationary $\alpha$ -mixing settings (e.g., time series, partially linear models), EL-based inference incorporates nonparametric estimates $\widehat{\eta}_\gamma$ of nuisance functions via kernel smoothing, with Wilks' theorem holding under appropriate mixing-rate conditions (Chaumaray et al., 2020).

6. Empirical Performance and Applications

Canonical experiments demonstrate the utility of EL and GEL in CMR problems:

Heteroskedastic linear regression:

Both kernel-based and neural-FGEL methods achieve the lowest MSE in $\hat{\theta}$ across sample sizes, outperforming traditional 2-step GMM and recent variational-moment estimators (Kremer et al., 2022, Kremer et al., 2023).

Instrumental-variable regression:

FGEL (kernel/neural) and KMM methods consistently yield lower test-MSEs compared to least squares, SMD, kernel/Neural VMM, and DeepIV, in both parametric and nonparametric settings (Kremer et al., 2022, Kremer et al., 2023).

7. Comparative Properties and Extensions

Empirical-likelihood estimators under CMR combine semiparametric efficiency, optimization flexibility, and accommodation of infinite unconditional restriction sets via RKHS, sieves, or neural parameterizations. They contrast with GMM, which is strictly limited to unconditional restrictions, and (kernelized) variational moment-matching, which may lack exact constraint satisfaction or efficiency properties without substantial regularization and basis-approximation (Kremer et al., 2022, Kremer et al., 2023, Chib et al., 2021).

Summary Table: Key Methodological Variants

Variant	Constraint Enforcement	Candidate Measure
EL / GEL	Data reweighting ( $P \ll \hat{P}_n$ )	Discrete (empirical)
KMM	MMD-based + entropy, moments	Law "off data"
ETEL/Sieve	Exp. tilting, sieve moments	Data reweighting
Neural-GEL	Network ( $\lambda_\omega(z)$ )	Flexible parametric

Each method achieves the Chamberlain semiparametric efficiency bound for appropriately chosen function classes; selection of basis dimension, kernel, or network size is critical for practical performance (Kremer et al., 2022, Chib et al., 2021, Kremer et al., 2023).