Levenberg-Marquardt Optimization

Updated 17 February 2026

Levenberg-Marquardt is an iterative method for nonlinear least-squares that dynamically adjusts a damping parameter to balance between Gauss–Newton and gradient descent.
It achieves robust local convergence with linear to quadratic rates and is applied in parameter estimation, machine learning, and inverse problems.
Enhancements like adaptive damping, singular scaling, and matrix-free implementations boost its stability and performance for ill-conditioned and large-scale issues.

The Levenberg–Marquardt (LM) optimization algorithm is a prominent iterative method for solving nonlinear least-squares problems. It serves as an interpolation between the Gauss–Newton algorithm and gradient descent by dynamically adjusting a damping (regularization) parameter. Across its diverse applications—including parameter estimation, machine learning, inverse problems, and trajectory design—the LM method is characterized by robust local convergence and the ability to control update magnitudes, making it especially effective in scenarios involving ill-conditioned or highly nonlinear models.

1. Mathematical Foundations and Classical Formulation

The standard LM method addresses the minimization of sum-of-squares objectives of the form

$\min_{x\in\mathbb{R}^n}~f(x) = \frac{1}{2} \|F(x)\|^2,$

where $F:\mathbb{R}^n\to\mathbb{R}^m$ is twice continuously differentiable. The method forms a local quadratic (Gauss–Newton) model at iterate $x_k$ : $m_k(s) = \frac{1}{2} \|F(x_k) + J_k s\|^2 + \frac{1}{2} \lambda_k \|s\|^2,$ with $J_k = J(x_k) = \nabla F(x_k)$ and a scalar damping parameter $\lambda_k > 0$ . The LM update is obtained by solving

$(J_k^T J_k + \lambda_k I) s_k = -J_k^T F(x_k),$

so that $x_{k+1} = x_k + s_k$ (Bergou et al., 2020, Philipps et al., 2020). As $\lambda_k\to 0$ , LM becomes Gauss–Newton; as $\lambda_k \to \infty$ , it recovers gradient descent.

In maximum likelihood estimation for exponential families, LM generalizes to

$\theta^{(t+1)} = \theta^{(t)} - [H_l(\theta^{(t)}) + \gamma^{(t)}P(\theta^{(t)})]^{-1} s(\theta^{(t)}),$

where $H_l$ is the observed Hessian and $P$ is a user-chosen negative-definite penalty, often $P = \operatorname{diag} H_l$ (Giordan et al., 2014).

2. Damping Parameter Adaptation and Trust-Region Interpretation

The choice and update of the damping parameter $\lambda_k$ critically affect LM performance. Standard update strategies involve decrease of $\lambda_k$ when a step yields sufficient reduction in $f(x)$ (step acceptance), and increase otherwise:

If $f(x_k + s_k) < f(x_k)$ , then $\lambda_{k+1} = \rho \lambda_k$ , $\rho \in (0,1)$ ;
Else, $\lambda_{k+1} = \tau \lambda_k$ , $\tau > 1$ (Protic et al., 2021). The "gain ratio"

$\rho_k = \frac{f(x_k) - f(x_k + s_k)}{m_k(0) - m_k(s_k)}$

can be used to refine updates, as in $\lambda_{k+1} = \lambda_k \max\{1/3, 1-(2\rho_k-1)^3\}$ if $\rho_k > 0$ (Giordan et al., 2014, Philipps et al., 2020, Nadjiasngar et al., 2011). This dynamic adjustment underpins the trust-region interpretation of LM, constraining step size to maintain stability and guarantee monotonic decrease of the residual norm.

3. Convergence Properties: Local and Global Behavior

The LM method is globally convergent under mild regularity and bounded level-set assumptions. For the unconstrained case, if $F$ is twice differentiable and the Jacobian is Lipschitz and bounded, the sequence $\{x_k\}$ converges to stationary points of $f$ , and an upper bound on the number of iterations to achieve gradient norm $\leq \epsilon$ is $\widetilde{O}(\epsilon^{-2})$ (Bergou et al., 2020).

Local convergence is guaranteed to be at least linear, and becomes quadratic in the so-called "zero-residual" regime (true solution $F(x_*) = 0$ ) provided the error-bound condition holds: $\operatorname{dist}(x, X^*) \leq M \|F(x) - F(\bar{x})\|,$ where $X^*$ is the set of minimizers. In the nonzero-residual case, the rate is linear (Bergou et al., 2020). For nonlinear inverse problems and certain regularized settings, it is possible to ensure quadratic local convergence by adaptively choosing regularization to match the data misfit (Daijun et al., 2015).

4. Algorithmic Enhancements and Generalizations

Numerous variants of the classical LM algorithm have been developed to address specific difficulties:

Singular Scaling: Incorporating possibly rank-deficient regularization matrices $L^T L$ to enforce smoothness or problem structure, notably in inverse problems with ill-posedness (Boos et al., 2023).
Robust Estimation: Employing LM within robust objective functions, such as LOVO (Lower Order-value Optimization), to downweight outliers. Here, at each iteration, the algorithm restricts to residuals of lowest magnitude, updating the active set and recalculating the LM step (Castelani et al., 2019).
q-Generalization: Replacing classical derivatives with q-derivatives to enhance global search properties and escape shallow local minima (Protic et al., 2021).
Accelerated-Proximal Schemes: For composite minimization (e.g., nonsmooth plus smooth-composite objectives), the "prox-linear" or generalized LM method solves damped, strongly convex subproblems with accelerated gradient sub-solvers, achieving quadratic convergence under quadratic growth conditions and providing sharp bounds on oracle complexity (Marumo et al., 2022).
Matrix-Free and Distributed Implementations: Large-scale settings benefit from algorithms leveraging dynamic programming and block-structured decompositions to parallelize the LM step or use Jacobian-vector products only (Haring et al., 2022, Bergou et al., 2020).

5. Applications Across Domains

The LM algorithm finds use across a broad range of fields:

Statistical Inference: Exponential-family maximum likelihood, compositional data models, and mixed-effects models (Giordan et al., 2014, Philipps et al., 2020).
Machine Learning: Neural network training, both for regression (standard LM) and classification (generalized Gauss–Newton variants), often with additional algorithmic improvements such as adaptive momentum or line searches (Pooladzandi et al., 2022, Shahab et al., 9 Feb 2026).
Inverse Problems: Parameter identification in PDEs (heat conduction, Robin inverse problems, etc.), where LM—augmented by adaptive regularization—yields robust, fast local convergence (Daijun et al., 2015, Boos et al., 2023).
Trajectory Optimization: Space mission design, using LM for multiple-shooting least-squares and adaptive weighting of trajectory constraints (Nunes et al., 21 Oct 2025).
Robust Fitting: Outlier-insensitive regression and circle-detection tasks via LOVO-LM variants (Castelani et al., 2019).
Data Assimilation: Constrained and unconstrained nonlinear least squares for sparse system identification and input-to-state estimation (Haring et al., 2022, Bergou et al., 2020).
Tensor Decomposition: Efficient low-rank approximation for compression tasks via LM-based multi-step prediction-correction techniques (Karim et al., 2024).

6. Practical Implementation and Performance Considerations

Implementations of LM must address several key concerns:

Jacobian Computation: Analytical expressions, automatic differentiation, finite-differences, or quasi-Newton (e.g., Broyden) updates may be used, with structured or block-sparse assembly wherever possible (Transtrum et al., 2012).
Stopping Criteria: Convergence is typically declared when parameter increments, gradient norm, or objective changes fall below prescribed thresholds (e.g., $\|\nabla f(x)\| < \epsilon$ ), or a maximum iteration count is reached (Bergou et al., 2020, Philipps et al., 2020).
Globalization: Line-search or trust-region strategies ensure sufficient decrease and safeguard against divergence. Nonmonotone acceptance rules or heuristic step rejections are recommended for constrained or ill-conditioned scenarios (Bergou et al., 2020).
Parallelization: Large-scale problems and computational bottlenecks in Jacobian or Hessian computation are alleviated via distributed or blockwise solution techniques (Haring et al., 2022, Philipps et al., 2020).
Tuning of Regularization: The choice of initial damping, regularization matrix, or adaptive weighting in the cost function has substantial impact. Model-informed scaling (e.g., $\lambda_k = \|F(x_k)\|^2$ ) and specialized adaptive schemes are preferred (Boos et al., 2023).
Robustness: The R package marqLevAlg integrates stringent stopping rules—including parameter, objective, and Hessian-based tests—to reduce the risk of premature convergence at stationary but nonoptimal points (Philipps et al., 2020).

Table: Summary of Damping Parameter Adaptation in Representative LM Variants

Variant / Paper	Damping Update Rule	Empirical Regime
Classical LM (Bergou et al., 2020)	Increase on step rejection, decrease on acceptance (trust-region test)	Local quadratic global
Exponential Family (Giordan et al., 2014)	Gain ratio-based: $\gamma_{t+1} = \gamma_t \max[1/3, 1-(2\rho-1)^3]$	Fast/stable
Robust LM-LOVO (Castelani et al., 2019)	Increase or decrease proportional to current gradient norm	Handles outliers
Generalized LM (Marumo et al., 2022)	Backtracking, increase until sufficient descent achieved	Composite objectives
Heat/Inverse (Boos et al., 2023)	$\lambda_k = \\|F(x_k)\\|^2$ , stepsize via Armijo or fixed	Smoothness regularized

7. Comparative Empirical Evidence

Empirical evaluations consistently demonstrate that the LM algorithm, particularly with adaptive damping and curvature-informed scaling, outperforms naive Gauss–Newton or basic first-order methods when the objective is nonlinear least-squares with moderate to strong nonlinearity or ill-conditioning. For example, in large compositional models (Dirichlet, Aitchison), LM converges reliably and rapidly where Newton–Raphson or fixed-point iteration may stall or diverge (Giordan et al., 2014). In neural networks and PINNs, LM attains lower final losses and solution errors than BFGS, SGD, Adam, or L-BFGS, often with fewer iterations and comparable or lower computational cost for moderate problem sizes (Shahab et al., 9 Feb 2026, Pooladzandi et al., 2022). In inverse problems with PDEs, LM with singular scaling solves ill-posed parameter identification problems with higher accuracy and fewer iterations than unregularized alternatives (Boos et al., 2023).

References

"On the maximization of likelihoods belonging to the exponential family using ideas related to the Levenberg-Marquardt approach" (Giordan et al., 2014)
"The q-Levenberg-Marquardt method for unconstrained nonlinear optimization" (Protic et al., 2021)
"A robust method based on LOVO functions for solving least squares problems" (Castelani et al., 2019)
"Designing trajectories in the Earth-Moon system: a Levenberg-Marquardt approach" (Nunes et al., 21 Oct 2025)
"Convergence and Complexity Analysis of a Levenberg-Marquardt Algorithm for Inverse Problems" (Bergou et al., 2020)
"Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization" (Transtrum et al., 2012)
"Gauss-Newton Filtering incorporating Levenberg-Marquardt Methods for Radar Tracking" (Nadjiasngar et al., 2011)
"Modified Levenberg-Marquardt Algorithm For Tensor CP Decomposition in Image Compression" (Karim et al., 2024)
"Levenberg-Marquardt method and partial exact penalty parameter selection in bilevel optimization" (Tin et al., 2021)
"Robust and Efficient Optimization Using a Marquardt-Levenberg Algorithm with R Package marqLevAlg" (Philipps et al., 2020)
"Levenberg-Marquardt method with Singular Scaling and applications" (Boos et al., 2023)
"Quadratic Convergence of Levenberg-Marquardt Method for Elliptic and Parabolic Inverse Robin Problems" (Daijun et al., 2015)
"Accelerated-gradient-based generalized Levenberg--Marquardt method with oracle complexity bound and local quadratic convergence" (Marumo et al., 2022)
"Do physics-informed neural networks (PINNs) need to be deep? Shallow PINNs using the Levenberg-Marquardt algorithm" (Shahab et al., 9 Feb 2026)
"A Nonmonotone Matrix-Free Algorithm for Nonlinear Equality-Constrained Least-Squares Problems" (Bergou et al., 2020)
"Improving Levenberg-Marquardt Algorithm for Neural Networks" (Pooladzandi et al., 2022)
"A Levenberg-Marquardt algorithm for sparse identification of dynamical systems" (Haring et al., 2022)