MP-LBFGS: Efficient Optimization for FBPINNs

Updated 20 January 2026

The algorithm leverages local LBFGS steps within FBPINN subdomains to form quasi-Newton corrections, significantly reducing global synchronizations.
It aggregates local corrections through a nonlinear subspace minimization, preserving curvature benefits while ensuring global secant consistency.
Empirical benchmarks on PDE problems show reduced training epochs and improved accuracy by trading off local computation for fewer communications.

Multi-Preconditioned LBFGS (MP-LBFGS) is an optimization algorithm developed to accelerate and robustify the training of finite-basis physics-informed neural networks (FBPINNs). Inspired by the nonlinear additive Schwarz method from domain decomposition in numerical PDEs, MP-LBFGS exploits the intrinsic additive structure of FBPINNs by constructing parallel, subdomain-local quasi-Newton corrections and then optimally combining them via a nonlinear, low-dimensional subspace minimization. This approach yields a preconditioned search direction that both improves convergence and reduces communication overhead in distributed learning settings (Salvadó-Benasco et al., 13 Jan 2026).

1. FBPINN Architecture and Domain Decomposition

The motivating problem is the solution of a partial differential equation (PDE) $\mathcal{P}[u](x) = f(x)$ on a bounded domain $\Omega \subset \mathbb{R}^d$ with boundary conditions $u = g$ on $\partial \Omega$ . A physics-informed neural network (PINN) approximates the unknown $u$ by a global neural network $N(\theta;x)$ , trained to minimize the squared PDE residual over collocation points.

FBPINNs address limitations such as spectral bias by decomposing $\Omega$ into $n_s$ overlapping subdomains $\{\Omega_j\}$ (with overlap width $\delta$ ). Each $\Omega_j$ receives a local network $N_j(\theta_j; x)$ ; the global solution is assembled via a smooth partition of unity $\{w_j\}$ satisfying $\sum_j w_j(x) \equiv 1$ and ${\rm supp}\,w_j \subset \Omega_j$ : $N(\theta;x) = \sum_{j=1}^{n_s} w_j(x) N_j(\theta_j; \mathrm{norm}_j(x)), \qquad \theta = (\theta_1, ..., \theta_{n_s}) \in \mathbb{R}^p,\ p = \sum_j p_j.$ The normalization $\mathrm{norm}_j$ maps local coordinates to $(-1, 1)^d$ . Collocation data and model parameters thus admit natural subdomain-local partitioning, facilitating parallelism.

2. Nonlinear Additive Schwarz Motivation

In classical PDE solvers, nonlinear additive Schwarz preconditioning accelerates Newton/quasi-Newton methods by performing local solves on each subdomain and then aggregating the results. MP-LBFGS transposes this methodology: each subnetwork solves a local training subproblem using several LBFGS steps, generating a local correction. Aggregation is enforced by a right-preconditioned LBFGS step that ensures global secant consistency.

The full-space optimality condition $\nabla L(\theta) = 0$ is reformulated as

$\mathcal{F}(\theta) := \nabla L(P(\theta)) = 0,$

with the nonlinear additive Schwarz map

$P(\theta) = \theta + \sum_{j=1}^{n_s} R_j^\top (\theta_j^* - R_j\theta),$

where $R_j$ restricts global parameters to subdomain $j$ , and $\theta_j^*$ approximates the local minimum via local LBFGS updates. This induces a lifted, preconditioned variable on which global LBFGS is performed, retaining curvature benefits while reducing communication frequency.

3. Algorithmic Components of MP-LBFGS

3.1 Local LBFGS Corrections

At each global iteration $k$ , the global parameters $\theta^{(k)}$ are distributed to each subdomain, yielding $\theta_j^{(k)} = R_j \theta^{(k)}$ . Each local network minimizes its loss $L_j$ by performing $\eta$ LBFGS steps: $s_j^{(k, q)} = \theta_j^{(k, q+1)} - \theta_j^{(k, q)}, \quad y_j^{(k, q)} = \nabla L_j(\theta_j^{(k, q+1)}) - \nabla L_j(\theta_j^{(k, q)}),$ $q = 0, ..., \eta - 1$ , resulting in an approximate minimizer $\theta_j^{(k,*)} = \theta_j^{(k, \eta)}$ and local correction $c_j^{(k)} = \theta_j^{(k,*)} - \theta_j^{(k)}$ .

3.2 Subspace Aggregation

Local corrections are collected as

$C^{(k)} = [ R_1^\top c_1^{(k)}, ... , R_{n_s}^\top c_{n_s}^{(k)} ] \in \mathbb{R}^{p \times n_s}.$

Coefficients $\beta = (\beta_1, ..., \beta_{n_s})$ are sought such that $d^{(k)} = C^{(k)}\beta$ minimizes the overall objective. While linear preconditioning would suggest

$\min_{\beta} \biggl\| \sum_{j=1}^{n_s} \beta_j (H_j^{(k)})^{-1} g^{(k)} - g^{(k)} \biggr\|^2, \quad \sum_j \beta_j = 1,$

robustness in the neural setting is enhanced by instead solving the nonlinear subspace minimization

$\min_{\beta \in \mathbb{R}^{n_s}} \phi(\beta) := L(\theta^{(k)} + C^{(k)}\beta).$

This problem is efficiently solved via a few (damped) Newton steps; the Hessian $(C^{(k)})^\top \nabla^2 L(\theta^{(k)}) C^{(k)}$ is only $n_s \times n_s$ and thus computationally cheap.

3.3 Global LBFGS Step

The updated parameter is $\tilde{\theta}^{(k)} = \theta^{(k)} + C^{(k)}\beta^*$ , with new gradient $\tilde{g} = \nabla L(\tilde{\theta}^{(k)})$ . The global secant pair $(s^{(k)}, y^{(k)})$ is then formed: $s^{(k)} = \tilde{\theta}^{(k)} - \theta^{(k)},\qquad y^{(k)} = \tilde{g} - g^{(k)}.$ The global LBFGS memory is updated, the descent direction $d_{\text{LBFGS}}^{(k)} = - B^{(k)^{-1}} \tilde{g}$ is computed with standard recursion, and a Wolfe-condition line search yields

$\theta^{(k+1)} = \tilde{\theta}^{(k)} + \alpha^{(k)} d_{\text{LBFGS}}^{(k)}.$

4. Iterative Structure and Communication Patterns

The MP-LBFGS outer iteration proceeds as follows:

Step	Operation	Communication Pattern
1	Compute $\nabla L(\theta^{(k)})$	Global all-reduce
2	Parallel local LBFGS	None; fully local
3	Aggregate local corrections	Minimal (local to global)
4	Subspace minimization	Fully local or negligible
5-10	Global LBFGS/line search	Standard (as in vanilla LBFGS)

Relative to traditional LBFGS, MP-LBFGS reduces the number of global synchronizations by a factor $\sim \eta$ , replacing some forward/backward passes over the full network with local computations.

5. Computational Complexity per Outer Iteration

Let $p$ be the total parameter count, $p_j$ per subdomain, $Q$ the global LBFGS memory size, and $\eta$ the number of local steps:

Forward/backward passes: $1$ global gradient eval plus $\#\text{ls}$ extra loss evals; $\eta$ local gradients per subdomain in parallel. Subspace minimization involves a few parallel evaluations.
LBFGS recursion: $O(p Q)$ flops and $O(p Q)$ memory for global step; $O(p_j Q_j)$ per local memory.
Communication: One all-reduce per global gradient and per line search; MP-LBFGS thus reduces the number of synchronizations by $\sim \eta$ compared to standard LBFGS.

This configuration enables a tradeoff between local computational work and synchronization frequency.

6. Empirical Performance and Benchmarks

Numerical experiments on 1D and 2D Poisson equations and the 2D time-dependent Burgers' equation used FBPINN configurations with four-layer ResNets (width 20), 2,000–20,000 Hammersley-sampled collocation points, and various uniform decompositions ( $2\times 2$ , $4\times 2$ , $3\times 3$ ).

Scaling strategies compared:

Uniform scaling (UniS)
Line-search scaling (LSS)
Subspace minimization (SPM)

Key empirical findings:

SPM was most robust and stable as $n_s$ increased.
MP-LBFGS with SPM reduced the required outer epochs (global synchronizations) by up to an order of magnitude compared with standard LBFGS.
Total per-device gradient work was comparable or lower; final validation error in $L^2$ norm was often an order of magnitude smaller.
Increasing local LBFGS steps $\eta$ further reduced synchronizations, trading increased local computation.

7. Practical Recommendations and Insights

The nonlinear Schwarz-type preconditioner enabled by the FBPINN decomposition allows extensive local computation, substantially reducing communication overhead in distributed settings. Subspace minimization is critical for stability: uniform or sequential line searches fail to scale as the number of subdomains increases.

Implementation recommendations:

Tune $\eta$ (local steps), $Q$ (memory size), and $n_s$ (number of subdomains) to balance local computation against synchronization.
Retain separate local and global LBFGS memories to avoid mixing curvature information.
Precompute and cache small Hessian blocks $(C^\top \nabla^2 L\,C)$ to expedite subspace Newton solves.
MP-LBFGS can be integrated into existing FBPINN codebases by wrapping local training in an outer loop, collecting corrections, solving the $n_s$ -dimensional subspace problem, then applying a standard global LBFGS update.

In summary, MP-LBFGS leverages the additive decomposition of FBPINNs to perform parallel, local quasi-Newton updates, followed by a global preconditioned update whose direction is defined by a small, nonlinear subspace optimization. This mechanism accelerates convergence, lowers wall-clock time in distributed environments, and can improve final model accuracy relative to standard LBFGS (Salvadó-Benasco et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Multi-Preconditioned LBFGS for Training Finite-Basis PINNs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Preconditioned LBFGS (MP-LBFGS).

MP-LBFGS: Efficient Optimization for FBPINNs

1. FBPINN Architecture and Domain Decomposition

2. Nonlinear Additive Schwarz Motivation

3. Algorithmic Components of MP-LBFGS

3.1 Local LBFGS Corrections

3.2 Subspace Aggregation

3.3 Global LBFGS Step

4. Iterative Structure and Communication Patterns

5. Computational Complexity per Outer Iteration

6. Empirical Performance and Benchmarks

7. Practical Recommendations and Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MP-LBFGS: Efficient Optimization for FBPINNs

1. FBPINN Architecture and Domain Decomposition

2. Nonlinear Additive Schwarz Motivation

3. Algorithmic Components of MP-LBFGS

3.1 Local LBFGS Corrections

3.2 Subspace Aggregation

3.3 Global LBFGS Step

4. Iterative Structure and Communication Patterns

5. Computational Complexity per Outer Iteration

6. Empirical Performance and Benchmarks

7. Practical Recommendations and Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research