L1-Penalised Marginal Likelihood
- The L1-Penalised Marginal Likelihood Function is a regularization technique that adds an L1 penalty to the marginal likelihood, promoting sparsity in high-dimensional models.
- It employs methods like block coordinate descent, EM-type algorithms, and proximal updates to tackle non-convexity and stabilize optimization in complex latent and graphical models.
- The approach achieves boundedness and convergence, enabling simultaneous parameter estimation and variable selection with provable theoretical guarantees.
The -Penalised Marginal Likelihood Function introduces sparsity into likelihood-based inference for high-dimensional statistical models by augmenting the marginal likelihood with an penalty term. This approach regularizes parameter estimates, stabilizes optimization under non-convexity, and enables simultaneous estimation and variable selection in complex latent or graphical models. The framework has prominent instantiations in finite mixture regression models, constrained marginal log-linear models, and graphical model selection, where the number of parameters or features can drastically exceed sample size.
1. Formulation and Scope
The -penalised marginal likelihood, in its general form, modifies the usual log-likelihood by subtracting an norm of a parameter or transformation of a parameter vector. For a parameterization and penalty vector as in constrained marginal models, the penalised objective is
or, in the lasso-analog format for uniform regularization,
The marginal likelihood may encompass mixtures over latent variables, marginalization over discrete or continuous distributions, or log-partition functions in graphical models (Städler et al., 2012, Evans et al., 2011, 0707.0704).
The scope of -penalised marginal likelihood covers:
- Finite mixture regression (FMR) models with unknown latent membership and possibly (Städler et al., 2012)
- Marginal log-linear models for discrete data with or without exogenous covariates (Evans et al., 2011)
- Sparse Gaussian and binary graphical models via penalized precision matrices or log-partition relaxations (0707.0704)
2. Model Classes and Mathematical Structure
Mixture Regression Models
For observed , , , and latent class , the marginal likelihood is
Direct maximization is ill-posed due to non-convexity and singularities (). The -penalised objective, with scale-invariant reparameterization , , is
Marginal Log-linear Models
For marginal log-linear parameters , the penalised criterion is
Optimization proceeds via quadratic approximation and coordinate-wise soft thresholding.
Sparse Graphical Models
For Gaussian observations , the penalised problem for the precision matrix is
where is the sample covariance. For binary (Ising) models, the log-partition function is relaxed using the log-determinant bound, and the dual yields the same entrywise -penalty (0707.0704).
3. Optimization Algorithms
Numerical optimization of -penalised marginal likelihoods confronts non-convexity, scale invariance, and high dimensionality. Key algorithmic approaches:
Block Coordinate Methods and EM-Type Algorithms
- In penalised mixture regression, the block coordinate descent generalized EM (BCD-GEM) alternates between E-steps (responsibility calculation) and M-steps (closed-form or coordinate descent updates for , , ).
- Soft-thresholding is employed in updates to enforce sparsity:
where is a subgradient term, ensuring zeros in non-relevant coordinates (Städler et al., 2012).
Regression-Style and Proximal Algorithms
- Local quadratic expansion of the likelihood combined with linearization of in terms of allows penalised updates by coordinate-wise soft-thresholding (Evans et al., 2011).
- For marginal log-linear models, the surrogate penalised objective
is updated via
Large-Scale Graphical Model Selection
- Block-coordinate descent in Gaussian graphical models iteratively solves -constrained regression problems (recursive lasso) in columns of the precision matrix (0707.0704).
- Nesterov’s first-order smoothing of the norm enables efficient gradient-based algorithms, with per-iteration and overall complexity.
Algorithmic Complexity
A comparative summary of update complexities:
| Model | Algorithm | Per iteration complexity | Scaling (total) |
|---|---|---|---|
| FMR () | BCD-GEM, CD | for active set | Empirically fast |
| Marginal log-linear | Coordinate CD | per update | Linear in for covariates |
| Graphical (Gaussian) | Block-CD, Nest. | Block-CD: , Nest.: |
4. Boundedness, Non-Convexity, and Variable Selection
penalisation not only induces sparsity but also regularizes the likelihood surface, ensuring boundedness even under pathological parameter values. In finite mixture models, penalising in the scale-invariant prevents the objective from diverging as , a distinct necessity in non-convex marginal likelihoods (Städler et al., 2012).
Soft-thresholding across coordinates yields exact zeros in parameter estimates, thereby performing variable selection concurrent with likelihood optimization. In mixture models, the selected set
captures the relevant predictors in each mixture, and similarly penalised log-linear models automatically select submodels with conditional/marginal independence structures (Evans et al., 2011).
A plausible implication is that the -penalised marginal likelihood forms a unified solution to high-dimensional variable selection in non-convex latent variable or graphical inference.
5. Theoretical Properties and Selection Guarantees
Theoretical results underpinning -penalised marginal likelihoods include:
- Boundedness: Proposition 1 (Städler et al., 2012) shows that for reasonable penalty choices, the penalised objective is bounded below, circumventing singularities.
- Convergence: Theorem 8 (Städler et al., 2012), Tseng’s theorem (Evans et al., 2011), block-coordinate methods (0707.0704): Under mild regularity, block-coordinate/proximal algorithms converge to stationary points of the penalised objective.
- Oracle Inequalities: Under restricted eigenvalue and identifiability conditions, global minimizers achieve predictable risk bounds:
where is sparsity (Städler et al., 2012).
- Support Recovery: Choice of of order ensures high-probability recovery of the true sparsity pattern in precision matrices (0707.0704).
Low-dimensional (fixed ) settings validate asymptotic normality and variable selection consistency via adaptive penalties (Städler et al., 2012). Selection properties for adaptive lasso in marginal-log-linear models are referenced in Evans’s thesis (Evans et al., 2011).
6. Practical Implementation and Tuning Strategies
Practical application of -penalised marginal likelihood requires selection of tuning parameters, initialization schemes, and efficient iteration termination:
- Tuning : Modified BIC,
where is the effective parameter count, and cross-validation on held-out likelihood is routinely employed (Städler et al., 2012, Evans et al., 2011).
- -Grid: down to near-zero grid (Städler et al., 2012).
- Initialization: Random soft assignment to latent class probabilities, zeroed parameters, equi-probable mixtures, or regression starting values.
- Active-Set Strategy: Only update coordinates with nonzero current estimates most iterations, resorting to full updates periodically for speedup (Städler et al., 2012).
- Convergence Monitoring: Thresholds on relative change (e.g., ) in objective and parameter norms.
Complex models (e.g., individual covariates in log-linear models) are handled efficiently by regression-style algorithms, whereas constrained (Lagrangian) methods are computationally prohibitive for modest sample sizes (Evans et al., 2011).
7. Connections and Comparative Perspectives
-penalised marginal likelihood generalizes the lasso (-penalised least squares) to settings where the likelihood includes latent variables, marginalization, or partition functions, inducing non-convexity. In mixture models, convexity is lost due to the log-sum-exp structure, distinguishing these applications from convex penalised regression.
Graphical model selection via -penalised maximum likelihood in Gaussian and Ising models employs convex formulations (0707.0704) but faces challenges in scaling to large ; first-order methods and block coordinate descent provide tractable alternatives to interior point methods. For discrete data, marginal log-linear models fit via -penalised quadratic approximation and coordinate descent yield sparse submodels and scale to large sample size and covariate dimension.
The method’s ability to induce exact zeros, prevent overfitting under limited data, and enable interpretable high-dimensional modeling links it to broader regularization and model selection frameworks in statistical learning.
References:
- Städler, Bühlmann, van de Geer (2010), “-Penalization for Mixture Regression Models,” (Städler et al., 2012)
- Bergsma, Rudas (2011), “Two algorithms for fitting constrained marginal models,” (Evans et al., 2011)
- Banerjee, El Ghaoui, d’Aspremont (2008), “Model Selection Through Sparse Maximum Likelihood Estimation,” (0707.0704)