Papers
Topics
Authors
Recent
Search
2000 character limit reached

L1-Penalised Marginal Likelihood

Updated 31 January 2026
  • The L1-Penalised Marginal Likelihood Function is a regularization technique that adds an L1 penalty to the marginal likelihood, promoting sparsity in high-dimensional models.
  • It employs methods like block coordinate descent, EM-type algorithms, and proximal updates to tackle non-convexity and stabilize optimization in complex latent and graphical models.
  • The approach achieves boundedness and convergence, enabling simultaneous parameter estimation and variable selection with provable theoretical guarantees.

The L1L_1-Penalised Marginal Likelihood Function introduces sparsity into likelihood-based inference for high-dimensional statistical models by augmenting the marginal likelihood with an L1L_1 penalty term. This approach regularizes parameter estimates, stabilizes optimization under non-convexity, and enables simultaneous estimation and variable selection in complex latent or graphical models. The framework has prominent instantiations in finite mixture regression models, constrained marginal log-linear models, and graphical model selection, where the number of parameters or features can drastically exceed sample size.

1. Formulation and Scope

The L1L_1-penalised marginal likelihood, in its general form, modifies the usual log-likelihood l(θ)l(\theta) by subtracting an L1L_1 norm of a parameter or transformation of a parameter vector. For a parameterization η=η(θ)\eta = \eta(\theta) and penalty vector ν=(ν1,...,νt1)0\nu = (\nu_1, ..., \nu_{t-1}) \geq 0 as in constrained marginal models, the penalised objective is

ϕ(θ)=l(θ)j=1t1νjηj(θ)\phi(\theta) = l(\theta) - \sum_{j=1}^{t-1} \nu_j\,|\eta_j(\theta)|

or, in the lasso-analog format for uniform regularization,

ϕ(θ)=l(θ)λη(θ)1\phi(\theta) = l(\theta) - \lambda\,\|\eta(\theta)\|_1

The marginal likelihood may encompass mixtures over latent variables, marginalization over discrete or continuous distributions, or log-partition functions in graphical models (Städler et al., 2012, Evans et al., 2011, 0707.0704).

The scope of L1L_1-penalised marginal likelihood covers:

  • Finite mixture regression (FMR) models with unknown latent membership and possibly pnp \gg n (Städler et al., 2012)
  • Marginal log-linear models for discrete data with or without exogenous covariates (Evans et al., 2011)
  • Sparse Gaussian and binary graphical models via penalized precision matrices or log-partition relaxations (0707.0704)

2. Model Classes and Mathematical Structure

Mixture Regression Models

For observed (Yi,Xi)(Y_i,X_i), YiRY_i \in \mathbb{R}, XiRpX_i \in \mathbb{R}^p, and latent class Zi{1,...,K}Z_i \in \{1,...,K\}, the marginal likelihood is

L(Θ)=i=1nlog(r=1Kπr12πσrexp{(yixiβr)22σr2})L(\Theta) = \sum_{i=1}^n \log \left( \sum_{r=1}^{K} \pi_r \frac{1}{\sqrt{2\pi}\sigma_r} \exp\Big\{-\frac{(y_i-x_i^\top \beta_r)^2}{2\sigma_r^2}\Big\} \right)

Direct maximization is ill-posed due to non-convexity and singularities (σr0\sigma_r \to 0). The L1L_1-penalised objective, with scale-invariant reparameterization ϕr=βr/σr\phi_r = \beta_r/\sigma_r, ρr=1/σr\rho_r = 1/\sigma_r, is

1npen(Θ)=1ni=1nloghΘ(yixi)+λr=1Kϕr1-\frac{1}{n} \ell_{\rm pen}(\Theta) = -\frac{1}{n} \sum_{i=1}^n \log h_\Theta(y_i|x_i) + \lambda \sum_{r=1}^K \|\phi_r\|_1

Marginal Log-linear Models

For marginal log-linear parameters η\eta, the penalised criterion is

ϕ(θ)=l(θ)λη(θ)1\phi(\theta) = l(\theta) - \lambda \|\eta(\theta)\|_1

Optimization proceeds via quadratic approximation and coordinate-wise soft thresholding.

Sparse Graphical Models

For nn Gaussian observations y(k)Rpy^{(k)} \in \mathbb{R}^p, the penalised problem for the precision matrix Θ\Theta is

Θ^=argmaxΘ0{logdetΘtrace(SΘ)λΘ1}\hat\Theta = \arg \max_{\Theta \succ 0} \left\{ \log\det \Theta - \mathrm{trace}(S\Theta) - \lambda \|\Theta\|_1 \right\}

where SS is the sample covariance. For binary (Ising) models, the log-partition function is relaxed using the log-determinant bound, and the dual yields the same entrywise L1L_1-penalty (0707.0704).

3. Optimization Algorithms

Numerical optimization of L1L_1-penalised marginal likelihoods confronts non-convexity, scale invariance, and high dimensionality. Key algorithmic approaches:

Block Coordinate Methods and EM-Type Algorithms

  • In penalised mixture regression, the block coordinate descent generalized EM (BCD-GEM) alternates between E-steps (responsibility calculation) and M-steps (closed-form or coordinate descent updates for π\pi, ϕr\phi_r, ρr\rho_r).
  • Soft-thresholding is employed in ϕr,j\phi_{r,j} updates to enforce sparsity:

ϕr,jsgn(Sj)max{Sjnλ(πr(t+1))γ,0}X~j2\phi_{r,j} \leftarrow \text{sgn}(S_j)\, \frac{\max\{|S_j|-n\lambda(\pi_r^{(t+1)})^\gamma, 0\}}{\|\tilde X_j\|^2}

where SjS_j is a subgradient term, ensuring zeros in non-relevant coordinates (Städler et al., 2012).

Regression-Style and Proximal Algorithms

  • Local quadratic expansion of the likelihood combined with linearization of θ\theta in terms of η\eta allows penalised updates by coordinate-wise soft-thresholding (Evans et al., 2011).
  • For marginal log-linear models, the surrogate penalised objective

ϕ~(η)=Q(η)νjηj\tilde{\phi}(\eta) = Q(\eta) - \sum \nu_j |\eta_j|

is updated via

ηjsign(ηˇj)(ηˇjνj)+\eta_j \leftarrow \text{sign}(\check\eta_j)\left(|\check\eta_j| - \nu_j\right)_+

Large-Scale Graphical Model Selection

  • Block-coordinate descent in Gaussian graphical models iteratively solves L1L_1-constrained regression problems (recursive lasso) in columns of the precision matrix (0707.0704).
  • Nesterov’s first-order smoothing of the L1L_1 norm enables efficient gradient-based algorithms, with per-iteration O(p3)O(p^3) and overall O(p4.5/ϵ)O(p^{4.5}/\epsilon) complexity.

Algorithmic Complexity

A comparative summary of update complexities:

Model Algorithm Per iteration complexity Scaling (total)
FMR (pnp \gg n) BCD-GEM, CD O(pK)O(pK) for active set Empirically fast
Marginal log-linear Coordinate CD O((t1)2)O((t-1)^2) per update Linear in nn for covariates
Graphical (Gaussian) Block-CD, Nest. O(p3)O(p^3) Block-CD: O(Kp4)O(Kp^4), Nest.: O(p4.5/ϵ)O(p^{4.5}/\epsilon)

4. Boundedness, Non-Convexity, and Variable Selection

L1L_1 penalisation not only induces sparsity but also regularizes the likelihood surface, ensuring boundedness even under pathological parameter values. In finite mixture models, penalising in the scale-invariant ϕr\phi_r prevents the objective from diverging as σr0\sigma_r \to 0, a distinct necessity in non-convex marginal likelihoods (Städler et al., 2012).

Soft-thresholding across coordinates yields exact zeros in parameter estimates, thereby performing variable selection concurrent with likelihood optimization. In mixture models, the selected set

S^={(r,j):ϕ^r,j0}\hat S = \{(r, j): \hat\phi_{r,j} \neq 0\}

captures the relevant predictors in each mixture, and similarly penalised log-linear models automatically select submodels with conditional/marginal independence structures (Evans et al., 2011).

A plausible implication is that the L1L_1-penalised marginal likelihood forms a unified solution to high-dimensional variable selection in non-convex latent variable or graphical inference.

5. Theoretical Properties and Selection Guarantees

Theoretical results underpinning L1L_1-penalised marginal likelihoods include:

  • Boundedness: Proposition 1 (Städler et al., 2012) shows that for reasonable penalty choices, the penalised objective is bounded below, circumventing singularities.
  • Convergence: Theorem 8 (Städler et al., 2012), Tseng’s theorem (Evans et al., 2011), block-coordinate methods (0707.0704): Under mild regularity, block-coordinate/proximal algorithms converge to stationary points of the penalised objective.
  • Oracle Inequalities: Under restricted eigenvalue and identifiability conditions, global minimizers achieve predictable risk bounds:

KL-risk(Θ^)=OP(slog3nlogpn)\mathrm{KL\text{-}risk}(\hat \Theta) = O_P\left(s \frac{\log^3 n\, \log p}{n}\right)

where ss is sparsity (Städler et al., 2012).

  • Support Recovery: Choice of λ\lambda of order O(logp/n)O(\sqrt{\log p / n}) ensures high-probability recovery of the true sparsity pattern in precision matrices (0707.0704).

Low-dimensional (fixed p,Kp, K) settings validate asymptotic normality and variable selection consistency via adaptive penalties (Städler et al., 2012). Selection properties for adaptive lasso in marginal-log-linear models are referenced in Evans’s thesis (Evans et al., 2011).

6. Practical Implementation and Tuning Strategies

Practical application of L1L_1-penalised marginal likelihood requires selection of tuning parameters, initialization schemes, and efficient iteration termination:

  • Tuning λ\lambda: Modified BIC,

BIC=2(Θ^)+lognde\mathrm{BIC} = -2 \ell(\hat \Theta) + \log n \, d_e

where ded_e is the effective parameter count, and cross-validation on held-out likelihood is routinely employed (Städler et al., 2012, Evans et al., 2011).

  • λ\lambda-Grid: λmax=maxjY,Xj/(nY)\lambda_{\max} = \max_j |\langle Y, X_{\cdot j}\rangle| / (\sqrt{n}\|Y\|) down to near-zero grid (Städler et al., 2012).
  • Initialization: Random soft assignment to latent class probabilities, zeroed parameters, equi-probable mixtures, or regression starting values.
  • Active-Set Strategy: Only update coordinates with nonzero current estimates most iterations, resorting to full updates periodically for speedup (Städler et al., 2012).
  • Convergence Monitoring: Thresholds on relative change (e.g., 10610^{-6}) in objective and parameter norms.

Complex models (e.g., individual covariates in log-linear models) are handled efficiently by regression-style algorithms, whereas constrained (Lagrangian) methods are computationally prohibitive for modest sample sizes (Evans et al., 2011).

7. Connections and Comparative Perspectives

L1L_1-penalised marginal likelihood generalizes the lasso (L1L_1-penalised least squares) to settings where the likelihood includes latent variables, marginalization, or partition functions, inducing non-convexity. In mixture models, convexity is lost due to the log-sum-exp structure, distinguishing these applications from convex penalised regression.

Graphical model selection via L1L_1-penalised maximum likelihood in Gaussian and Ising models employs convex formulations (0707.0704) but faces challenges in scaling to large pp; first-order methods and block coordinate descent provide tractable alternatives to interior point methods. For discrete data, marginal log-linear models fit via L1L_1-penalised quadratic approximation and coordinate descent yield sparse submodels and scale to large sample size and covariate dimension.

The method’s ability to induce exact zeros, prevent overfitting under limited data, and enable interpretable high-dimensional modeling links it to broader regularization and model selection frameworks in statistical learning.

References:

  • Städler, Bühlmann, van de Geer (2010), “L1L_1-Penalization for Mixture Regression Models,” (Städler et al., 2012)
  • Bergsma, Rudas (2011), “Two algorithms for fitting constrained marginal models,” (Evans et al., 2011)
  • Banerjee, El Ghaoui, d’Aspremont (2008), “Model Selection Through Sparse Maximum Likelihood Estimation,” (0707.0704)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to $L_1$-Penalised Marginal Likelihood Function.