L1-Penalised Marginal Likelihood

Updated 31 January 2026

The L1-Penalised Marginal Likelihood Function is a regularization technique that adds an L1 penalty to the marginal likelihood, promoting sparsity in high-dimensional models.
It employs methods like block coordinate descent, EM-type algorithms, and proximal updates to tackle non-convexity and stabilize optimization in complex latent and graphical models.
The approach achieves boundedness and convergence, enabling simultaneous parameter estimation and variable selection with provable theoretical guarantees.

The $L_1$ -Penalised Marginal Likelihood Function introduces sparsity into likelihood-based inference for high-dimensional statistical models by augmenting the marginal likelihood with an $L_1$ penalty term. This approach regularizes parameter estimates, stabilizes optimization under non-convexity, and enables simultaneous estimation and variable selection in complex latent or graphical models. The framework has prominent instantiations in finite mixture regression models, constrained marginal log-linear models, and graphical model selection, where the number of parameters or features can drastically exceed sample size.

1. Formulation and Scope

The $L_1$ -penalised marginal likelihood, in its general form, modifies the usual log-likelihood $l(\theta)$ by subtracting an $L_1$ norm of a parameter or transformation of a parameter vector. For a parameterization $\eta = \eta(\theta)$ and penalty vector $\nu = (\nu_1, ..., \nu_{t-1}) \geq 0$ as in constrained marginal models, the penalised objective is

$\phi(\theta) = l(\theta) - \sum_{j=1}^{t-1} \nu_j\,|\eta_j(\theta)|$

or, in the lasso-analog format for uniform regularization,

$\phi(\theta) = l(\theta) - \lambda\,\|\eta(\theta)\|_1$

The marginal likelihood may encompass mixtures over latent variables, marginalization over discrete or continuous distributions, or log-partition functions in graphical models (Städler et al., 2012, Evans et al., 2011, 0707.0704).

The scope of $L_1$ -penalised marginal likelihood covers:

Finite mixture regression (FMR) models with unknown latent membership and possibly $p \gg n$ (Städler et al., 2012)
Marginal log-linear models for discrete data with or without exogenous covariates (Evans et al., 2011)
Sparse Gaussian and binary graphical models via penalized precision matrices or log-partition relaxations (0707.0704)

2. Model Classes and Mathematical Structure

Mixture Regression Models

For observed $(Y_i,X_i)$ , $Y_i \in \mathbb{R}$ , $X_i \in \mathbb{R}^p$ , and latent class $Z_i \in \{1,...,K\}$ , the marginal likelihood is

$L(\Theta) = \sum_{i=1}^n \log \left( \sum_{r=1}^{K} \pi_r \frac{1}{\sqrt{2\pi}\sigma_r} \exp\Big\{-\frac{(y_i-x_i^\top \beta_r)^2}{2\sigma_r^2}\Big\} \right)$

Direct maximization is ill-posed due to non-convexity and singularities ( $\sigma_r \to 0$ ). The $L_1$ -penalised objective, with scale-invariant reparameterization $\phi_r = \beta_r/\sigma_r$ , $\rho_r = 1/\sigma_r$ , is

$-\frac{1}{n} \ell_{\rm pen}(\Theta) = -\frac{1}{n} \sum_{i=1}^n \log h_\Theta(y_i|x_i) + \lambda \sum_{r=1}^K \|\phi_r\|_1$

Marginal Log-linear Models

For marginal log-linear parameters $\eta$ , the penalised criterion is

$\phi(\theta) = l(\theta) - \lambda \|\eta(\theta)\|_1$

Optimization proceeds via quadratic approximation and coordinate-wise soft thresholding.

Sparse Graphical Models

For $n$ Gaussian observations $y^{(k)} \in \mathbb{R}^p$ , the penalised problem for the precision matrix $\Theta$ is

$\hat\Theta = \arg \max_{\Theta \succ 0} \left\{ \log\det \Theta - \mathrm{trace}(S\Theta) - \lambda \|\Theta\|_1 \right\}$

where $S$ is the sample covariance. For binary (Ising) models, the log-partition function is relaxed using the log-determinant bound, and the dual yields the same entrywise $L_1$ -penalty (0707.0704).

3. Optimization Algorithms

Numerical optimization of $L_1$ -penalised marginal likelihoods confronts non-convexity, scale invariance, and high dimensionality. Key algorithmic approaches:

Block Coordinate Methods and EM-Type Algorithms

In penalised mixture regression, the block coordinate descent generalized EM (BCD-GEM) alternates between E-steps (responsibility calculation) and M-steps (closed-form or coordinate descent updates for $\pi$ , $\phi_r$ , $\rho_r$ ).
Soft-thresholding is employed in $\phi_{r,j}$ updates to enforce sparsity:

$\phi_{r,j} \leftarrow \text{sgn}(S_j)\, \frac{\max\{|S_j|-n\lambda(\pi_r^{(t+1)})^\gamma, 0\}}{\|\tilde X_j\|^2}$

where $S_j$ is a subgradient term, ensuring zeros in non-relevant coordinates (Städler et al., 2012).

Regression-Style and Proximal Algorithms

Local quadratic expansion of the likelihood combined with linearization of $\theta$ in terms of $\eta$ allows penalised updates by coordinate-wise soft-thresholding (Evans et al., 2011).
For marginal log-linear models, the surrogate penalised objective

$\tilde{\phi}(\eta) = Q(\eta) - \sum \nu_j |\eta_j|$

is updated via

$\eta_j \leftarrow \text{sign}(\check\eta_j)\left(|\check\eta_j| - \nu_j\right)_+$

Large-Scale Graphical Model Selection

Block-coordinate descent in Gaussian graphical models iteratively solves $L_1$ -constrained regression problems (recursive lasso) in columns of the precision matrix (0707.0704).
Nesterov’s first-order smoothing of the $L_1$ norm enables efficient gradient-based algorithms, with per-iteration $O(p^3)$ and overall $O(p^{4.5}/\epsilon)$ complexity.

Algorithmic Complexity

A comparative summary of update complexities:

Model	Algorithm	Per iteration complexity	Scaling (total)
FMR ( $p \gg n$ )	BCD-GEM, CD	$O(pK)$ for active set	Empirically fast
Marginal log-linear	Coordinate CD	$O((t-1)^2)$ per update	Linear in $n$ for covariates
Graphical (Gaussian)	Block-CD, Nest.	$O(p^3)$	Block-CD: $O(Kp^4)$ , Nest.: $O(p^{4.5}/\epsilon)$

4. Boundedness, Non-Convexity, and Variable Selection

$L_1$ penalisation not only induces sparsity but also regularizes the likelihood surface, ensuring boundedness even under pathological parameter values. In finite mixture models, penalising in the scale-invariant $\phi_r$ prevents the objective from diverging as $\sigma_r \to 0$ , a distinct necessity in non-convex marginal likelihoods (Städler et al., 2012).

Soft-thresholding across coordinates yields exact zeros in parameter estimates, thereby performing variable selection concurrent with likelihood optimization. In mixture models, the selected set

$\hat S = \{(r, j): \hat\phi_{r,j} \neq 0\}$

captures the relevant predictors in each mixture, and similarly penalised log-linear models automatically select submodels with conditional/marginal independence structures (Evans et al., 2011).

A plausible implication is that the $L_1$ -penalised marginal likelihood forms a unified solution to high-dimensional variable selection in non-convex latent variable or graphical inference.

5. Theoretical Properties and Selection Guarantees

Theoretical results underpinning $L_1$ -penalised marginal likelihoods include:

Boundedness: Proposition 1 (Städler et al., 2012) shows that for reasonable penalty choices, the penalised objective is bounded below, circumventing singularities.
Convergence: Theorem 8 (Städler et al., 2012), Tseng’s theorem (Evans et al., 2011), block-coordinate methods (0707.0704): Under mild regularity, block-coordinate/proximal algorithms converge to stationary points of the penalised objective.
Oracle Inequalities: Under restricted eigenvalue and identifiability conditions, global minimizers achieve predictable risk bounds:

$\mathrm{KL\text{-}risk}(\hat \Theta) = O_P\left(s \frac{\log^3 n\, \log p}{n}\right)$

where $s$ is sparsity (Städler et al., 2012).

Support Recovery: Choice of $\lambda$ of order $O(\sqrt{\log p / n})$ ensures high-probability recovery of the true sparsity pattern in precision matrices (0707.0704).

Low-dimensional (fixed $p, K$ ) settings validate asymptotic normality and variable selection consistency via adaptive penalties (Städler et al., 2012). Selection properties for adaptive lasso in marginal-log-linear models are referenced in Evans’s thesis (Evans et al., 2011).

6. Practical Implementation and Tuning Strategies

Practical application of $L_1$ -penalised marginal likelihood requires selection of tuning parameters, initialization schemes, and efficient iteration termination:

Tuning $\lambda$ : Modified BIC,

$\mathrm{BIC} = -2 \ell(\hat \Theta) + \log n \, d_e$

where $d_e$ is the effective parameter count, and cross-validation on held-out likelihood is routinely employed (Städler et al., 2012, Evans et al., 2011).

$\lambda$ -Grid: $\lambda_{\max} = \max_j |\langle Y, X_{\cdot j}\rangle| / (\sqrt{n}\|Y\|)$ down to near-zero grid (Städler et al., 2012).
Initialization: Random soft assignment to latent class probabilities, zeroed parameters, equi-probable mixtures, or regression starting values.
Active-Set Strategy: Only update coordinates with nonzero current estimates most iterations, resorting to full updates periodically for speedup (Städler et al., 2012).
Convergence Monitoring: Thresholds on relative change (e.g., $10^{-6}$ ) in objective and parameter norms.

Complex models (e.g., individual covariates in log-linear models) are handled efficiently by regression-style algorithms, whereas constrained (Lagrangian) methods are computationally prohibitive for modest sample sizes (Evans et al., 2011).

7. Connections and Comparative Perspectives

$L_1$ -penalised marginal likelihood generalizes the lasso ( $L_1$ -penalised least squares) to settings where the likelihood includes latent variables, marginalization, or partition functions, inducing non-convexity. In mixture models, convexity is lost due to the log-sum-exp structure, distinguishing these applications from convex penalised regression.

Graphical model selection via $L_1$ -penalised maximum likelihood in Gaussian and Ising models employs convex formulations (0707.0704) but faces challenges in scaling to large $p$ ; first-order methods and block coordinate descent provide tractable alternatives to interior point methods. For discrete data, marginal log-linear models fit via $L_1$ -penalised quadratic approximation and coordinate descent yield sparse submodels and scale to large sample size and covariate dimension.

The method’s ability to induce exact zeros, prevent overfitting under limited data, and enable interpretable high-dimensional modeling links it to broader regularization and model selection frameworks in statistical learning.

References:

Städler, Bühlmann, van de Geer (2010), “ $L_1$ -Penalization for Mixture Regression Models,” (Städler et al., 2012)
Bergsma, Rudas (2011), “Two algorithms for fitting constrained marginal models,” (Evans et al., 2011)
Banerjee, El Ghaoui, d’Aspremont (2008), “Model Selection Through Sparse Maximum Likelihood Estimation,” (0707.0704)

Markdown Report Issue Upgrade to Chat

References (3)

L1-Penalization for Mixture Regression Models (2012)

Two algorithms for fitting constrained marginal models (2011)

Model Selection Through Sparse Maximum Likelihood Estimation (2007)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to $L_1$-Penalised Marginal Likelihood Function.