Primal-Dual Variational Inference

Updated 10 February 2026

PD-VI is a scalable variational inference method that reformulates mean-field inference as a constrained optimization using an augmented Lagrangian framework.
It employs mini-batch primal-dual updates to jointly optimize local and global parameters, ensuring robust convergence even in high-dimensional, non-conjugate settings.
Empirical evaluations demonstrate that PD-VI, especially its block-preconditioned variant, outperforms traditional methods by achieving faster convergence and improved performance on both synthetic and real-world data.

Primal–Dual Variational Inference (PD-VI) is a methodology for scalable mean-field variational inference (MFVI) that reformulates the inference problem as a constrained optimization suitable for mini-batch primal-dual algorithms. By introducing an augmented Lagrangian framework and leveraging both primal and dual variational parameter updates, PD-VI jointly optimizes local and global parameters in the variational family in a scalable and theoretically well-founded manner. The method includes a block-preconditioned extension (P²D-VI) to accommodate parameter heterogeneity and non-isotropic curvature in large-scale latent variable models, providing both improved robustness and optimization efficiency (Lyu et al., 7 Feb 2026).

1. Problem Formulation

Mean-field variational inference seeks to approximate an intractable posterior $p(z,\beta\mid x)$ by a factorized variational family,

$q_{\phi,\lambda}(z,\beta)=\prod_{i=1}^n q_{\phi_i}(z_i)\cdot q_\lambda(\beta),$

with the objective of minimizing the Kullback–Leibler divergence,

$\mathrm{KL}(q\|p) = \mathbb{E}_{q}[\log q(z,\beta)-\log p(z,\beta,x)],$

which is equivalent to minimizing the evidence lower bound (ELBO),

$\min_{\phi,\lambda} f(\phi,\lambda), \qquad f(\phi,\lambda)=\frac{1}{n}\sum_{i=1}^n f_i(\phi_i,\lambda),$

where each local term $f_i(\phi_i,\lambda)$ expresses the evidence contribution of data point $x_i$ . To enable efficient mini-batch optimization, PD-VI introduces local copies $\lambda_i$ of the global parameter $\lambda$ and enforces consensus constraints $\lambda_i = \lambda_0$ for all $i$ . The resulting finite-sum problem is:

$\min_{\phi_{1:n},\lambda_{0:n}} \frac{1}{n}\sum_{i=1}^n f_i(\phi_i,\lambda_i) \quad \text{subject to}\quad \lambda_i = \lambda_0.$

2. Augmented Lagrangian and Primal–Dual Structure

Consensus constraints are incorporated via Lagrange multipliers $\mu_i$ and a quadratic penalty parameter $\eta>0$ , yielding the (scaled) augmented Lagrangian,

$\mathcal{L}(\phi_{1:n}, \lambda_{0:n}, \mu_{1:n}) = \frac{1}{n}\sum_{i=1}^n \left\{ f_i(\phi_i, \lambda_i) + \langle \mu_i, \lambda_i - \lambda_0 \rangle + \frac{1}{2\eta}\|\lambda_i - \lambda_0\|^2 \right\}.$

The optimization seeks a saddle point,

$\min_{\phi,\lambda_i}\max_{\mu} \mathcal{L}(\phi, \lambda_i, \mu),$

and lends itself to alternating primal–dual updates, for which only a subset of local parameters is updated per iteration, conforming with large-scale data regimes.

3. Primal–Dual VI and Block-Preconditioned VI Algorithms

At each iteration, a subset $S_t$ of data indices is sampled for batch updates. For each $i\in S_t$ (in parallel), the following local subproblem is solved (Oracle I):

$(\phi_i^t, \lambda_i^t) = \arg\min_{\phi_i, \lambda_i} f_i(\phi_i, \lambda_i) + \langle \mu_i^{t-1}, \lambda_i - \lambda_0^{t-1} \rangle + \frac{1}{2\eta}\|\lambda_i - \lambda_0^{t-1}\|^2,$

$\mu_i^t = \mu_i^{t-1} + \frac{1}{\eta} (\lambda_i^t - \lambda_0^{t-1}).$

Auxiliary variables accumulate parameter increments, and the global parameter is updated via:

$h^t = h^{t-1} + \frac{1}{n}\sum_{i\in S_t}\left( \lambda_i^t - \lambda_0^{t-1} \right),\quad \lambda_0^t = \frac{1}{|S_t|} \sum_{i\in S_t} \lambda_i^t + h^t.$

For heterogeneous parameter blocks, PD-VI is extended to P²D-VI. The parameter $\lambda$ is partitioned into $B$ blocks with block-specific penalties $\eta_j$ and block-wise preconditioner $D_\eta = \text{diag}(1/\eta_1\, I_{d_1},\dots,1/\eta_B\, I_{d_B})$ . The penalty norm becomes $\frac{1}{2}\|\lambda_i - \lambda_0\|_{D_\eta}^2$ and the dual update $\mu_i \leftarrow \mu_i + D_\eta(\lambda_i - \lambda_0)$ . Oracle II is:

$(\phi_i^t,\lambda_i^t) = \arg\min_{\phi_i,\lambda_i} f_i(\phi_i,\lambda_i) + \langle\mu_i^{t-1}, \lambda_i-\lambda_0^{t-1}\rangle + \frac{1}{2}\|\lambda_i-\lambda_0^{t-1}\|_{D_\eta}^2,$

$\mu_i^t = \mu_i^{t-1} + D_\eta (\lambda_i^t - \lambda_0^{t-1}).$

Block-specific penalties $\eta_j$ are chosen as $\eta_j \propto 1/L_{\lambda, j}$ , where $L_{\lambda, j}$ represents the Lipschitz constant for block $j$ .

Mini-Batch Primal–Dual VI (PD-VI) Pseudocode

Initialize λ_0⁰, η, T
λ_i⁰ ← λ_0⁰, μ_i⁰ ← 0, h⁰ ← 0 for all i
for t in 1 … T:
    Sample mini-batch S_t of size m
    for i in S_t (in parallel):
        (φ_i^t, λ_i^t) = argmin_{φ_i,λ_i} f_i(φ_i,λ_i) + <μ_i^{t−1},λ_i−λ_0^{t−1}> + 1/(2η)||λ_i−λ_0^{t−1}||^2
        μ_i^t = μ_i^{t−1} + 1/η (λ_i^t − λ_0^{t−1})
    for i not in S_t:
        φ_i^t = φ_i^{t-1}, λ_i^t = λ_i^{t-1}, μ_i^t = μ_i^{t-1}
    h^t = h^{t-1} + (1/n) ∑_{i∈S_t}(λ_i^t − λ_0^{t−1})
    λ_0^t = (1/m) ∑_{i∈S_t} λ_i^t + h^t

Block-Preconditioned Oracle (P²D-VI) Pseudocode

1
2
3

Input: f_i, λ_0, μ_i, block-diag D_η
(φ_i^t, λ_i^t) = argmin_{φ_i, λ_i} f_i(φ_i, λ_i) + <μ_i, λ_i − λ_0> + 0.5 * ||λ_i − λ_0||_{D_η}^2
μ_i^t = μ_i + D_η (λ_i^t − λ_0)

Both algorithms employ constant step sizes, enable parallel parameter updates, and adapt to geometric differences in parameter blocks.

4. Convergence Properties

PD-VI and P²D-VI achieve provable convergence rates under mild smoothness assumptions without relying on conjugacy or explicit variance control. For nonconvex $f_i$ that are strongly convex in $\phi_i$ and step size $\eta \leq O(\omega/(L(1+\kappa)))$ (with $\omega = m/n$ , $\kappa = L^2/\mu^2$ ), PD-VI yields

$\frac{1}{T} \sum_{t=1}^T \mathbb{E}\big\| \nabla_\lambda f(\phi^{t-1}, \lambda_0^{t-1}) \big\|^2 = O(1/T).$

For convex $f_i$ , the expected objective gap for averaged iterates is $O(1/T)$ , and for strongly convex $f_i$ , the weighted iterate gap contracts at rate $O(1/r^T)$ with $r=1+\mu \eta$ . For P²D-VI, letting $S = \sum_j \eta_j^2 L_j^2$ and $\{\eta_j\}$ chosen such that $S \leq C$ (constant), the same $O(1/T)$ guarantee holds in the block-diagonal norm

$\|v\|^2_H = \sum_j \eta_j \|v_j\|^2,$

enabling blockwise weighted descent (Lyu et al., 7 Feb 2026).

5. Empirical Evaluation

Empirical validation encompasses large-scale synthetic and real-world datasets. On synthetic Gaussian mixtures (100,000 points, 10 dimensions, 5 clusters) with biased mini-batches, both PD-VI and P²D-VI demonstrate faster convergence and reach lower Wasserstein distance to the true mixture than SVI, SGD, Adam, and CV-based methods. For spatial transcriptomics (MOSTA dataset: ≈150,000 spatial spots, 20,000 genes) using a non-conjugate Potts-augmented mixture model, PD-VI and especially P²D-VI achieve lower ELBO values, smaller gradient norms, and higher adjusted Rand index (ARI) in fewer iterations compared to SVI, RMSProp, and Adam. Domain clustering maps produced by these methods exhibit sharper and more anatomically coherent results, and the block-preconditioned extension confers further performance improvements.

6. Methodological Significance and Context

PD-VI and its block-preconditioned variant allow for scalable and robust variational inference in non-conjugate and high-dimensional settings. By employing primal-dual strategies native to constrained finite-sum optimization, the methodology simultaneously updates local and global variational parameters within a mini-batch framework without diminishing theoretical guarantees. The incorporation of block-adaptive penalties via P²D-VI offers improved adaptation to curvature heterogeneity, overcoming limitations of isotropic penalty approaches typical of classical SVI updates. This methodology is particularly advantageous where parameter block geometry varies or loss landscapes are highly anisotropic (Lyu et al., 7 Feb 2026).

A central distinction of PD-VI lies in its formulation as an augmented Lagrangian saddle-point problem, which contrasts with classical mean-field variational inference strategies (e.g., coordinate ascent VI, standard SVI), that rely on explicit ELBO maximization and often require either closed-form updates or careful tuning for stochastic optimization. PD-VI’s independence from conjugacy and bounded-gradient-variance assumptions substantially broadens its applicability. The use of primal-dual updates and block-diagonal preconditioning situates the method in the broader context of modern stochastic optimization with consensus constraints and blockwise adaptive learning rates. This suggests that PD-VI and P²D-VI may serve as templates for scalable posterior approximation in latent variable models across diverse scientific domains.

Reference: "Scalable Mean-Field Variational Inference via Preconditioned Primal-Dual Optimization" (Lyu et al., 7 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Scalable Mean-Field Variational Inference via Preconditioned Primal-Dual Optimization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Primal-Dual Variational Inference (PD-VI).

Primal-Dual Variational Inference

1. Problem Formulation

2. Augmented Lagrangian and Primal–Dual Structure

3. Primal–Dual VI and Block-Preconditioned VI Algorithms

Mini-Batch Primal–Dual VI (PD-VI) Pseudocode

Block-Preconditioned Oracle (P²D-VI) Pseudocode

4. Convergence Properties

5. Empirical Evaluation

6. Methodological Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Primal-Dual Variational Inference

1. Problem Formulation

2. Augmented Lagrangian and Primal–Dual Structure

3. Primal–Dual VI and Block-Preconditioned VI Algorithms

Mini-Batch Primal–Dual VI (PD-VI) Pseudocode

Block-Preconditioned Oracle (P²D-VI) Pseudocode

4. Convergence Properties

5. Empirical Evaluation

6. Methodological Significance and Context

7. Connections to Related Optimization and Inference Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research