Papers
Topics
Authors
Recent
Search
2000 character limit reached

Primal-Dual Variational Inference

Updated 10 February 2026
  • PD-VI is a scalable variational inference method that reformulates mean-field inference as a constrained optimization using an augmented Lagrangian framework.
  • It employs mini-batch primal-dual updates to jointly optimize local and global parameters, ensuring robust convergence even in high-dimensional, non-conjugate settings.
  • Empirical evaluations demonstrate that PD-VI, especially its block-preconditioned variant, outperforms traditional methods by achieving faster convergence and improved performance on both synthetic and real-world data.

Primal–Dual Variational Inference (PD-VI) is a methodology for scalable mean-field variational inference (MFVI) that reformulates the inference problem as a constrained optimization suitable for mini-batch primal-dual algorithms. By introducing an augmented Lagrangian framework and leveraging both primal and dual variational parameter updates, PD-VI jointly optimizes local and global parameters in the variational family in a scalable and theoretically well-founded manner. The method includes a block-preconditioned extension (P²D-VI) to accommodate parameter heterogeneity and non-isotropic curvature in large-scale latent variable models, providing both improved robustness and optimization efficiency (Lyu et al., 7 Feb 2026).

1. Problem Formulation

Mean-field variational inference seeks to approximate an intractable posterior p(z,βx)p(z,\beta\mid x) by a factorized variational family,

qϕ,λ(z,β)=i=1nqϕi(zi)qλ(β),q_{\phi,\lambda}(z,\beta)=\prod_{i=1}^n q_{\phi_i}(z_i)\cdot q_\lambda(\beta),

with the objective of minimizing the Kullback–Leibler divergence,

KL(qp)=Eq[logq(z,β)logp(z,β,x)],\mathrm{KL}(q\|p) = \mathbb{E}_{q}[\log q(z,\beta)-\log p(z,\beta,x)],

which is equivalent to minimizing the evidence lower bound (ELBO),

minϕ,λf(ϕ,λ),f(ϕ,λ)=1ni=1nfi(ϕi,λ),\min_{\phi,\lambda} f(\phi,\lambda), \qquad f(\phi,\lambda)=\frac{1}{n}\sum_{i=1}^n f_i(\phi_i,\lambda),

where each local term fi(ϕi,λ)f_i(\phi_i,\lambda) expresses the evidence contribution of data point xix_i. To enable efficient mini-batch optimization, PD-VI introduces local copies λi\lambda_i of the global parameter λ\lambda and enforces consensus constraints λi=λ0\lambda_i = \lambda_0 for all ii. The resulting finite-sum problem is:

minϕ1:n,λ0:n1ni=1nfi(ϕi,λi)subject toλi=λ0.\min_{\phi_{1:n},\lambda_{0:n}} \frac{1}{n}\sum_{i=1}^n f_i(\phi_i,\lambda_i) \quad \text{subject to}\quad \lambda_i = \lambda_0.

2. Augmented Lagrangian and Primal–Dual Structure

Consensus constraints are incorporated via Lagrange multipliers μi\mu_i and a quadratic penalty parameter η>0\eta>0, yielding the (scaled) augmented Lagrangian,

L(ϕ1:n,λ0:n,μ1:n)=1ni=1n{fi(ϕi,λi)+μi,λiλ0+12ηλiλ02}.\mathcal{L}(\phi_{1:n}, \lambda_{0:n}, \mu_{1:n}) = \frac{1}{n}\sum_{i=1}^n \left\{ f_i(\phi_i, \lambda_i) + \langle \mu_i, \lambda_i - \lambda_0 \rangle + \frac{1}{2\eta}\|\lambda_i - \lambda_0\|^2 \right\}.

The optimization seeks a saddle point,

minϕ,λimaxμL(ϕ,λi,μ),\min_{\phi,\lambda_i}\max_{\mu} \mathcal{L}(\phi, \lambda_i, \mu),

and lends itself to alternating primal–dual updates, for which only a subset of local parameters is updated per iteration, conforming with large-scale data regimes.

3. Primal–Dual VI and Block-Preconditioned VI Algorithms

At each iteration, a subset StS_t of data indices is sampled for batch updates. For each iSti\in S_t (in parallel), the following local subproblem is solved (Oracle I):

(ϕit,λit)=argminϕi,λifi(ϕi,λi)+μit1,λiλ0t1+12ηλiλ0t12,(\phi_i^t, \lambda_i^t) = \arg\min_{\phi_i, \lambda_i} f_i(\phi_i, \lambda_i) + \langle \mu_i^{t-1}, \lambda_i - \lambda_0^{t-1} \rangle + \frac{1}{2\eta}\|\lambda_i - \lambda_0^{t-1}\|^2,

μit=μit1+1η(λitλ0t1).\mu_i^t = \mu_i^{t-1} + \frac{1}{\eta} (\lambda_i^t - \lambda_0^{t-1}).

Auxiliary variables accumulate parameter increments, and the global parameter is updated via:

ht=ht1+1niSt(λitλ0t1),λ0t=1StiStλit+ht.h^t = h^{t-1} + \frac{1}{n}\sum_{i\in S_t}\left( \lambda_i^t - \lambda_0^{t-1} \right),\quad \lambda_0^t = \frac{1}{|S_t|} \sum_{i\in S_t} \lambda_i^t + h^t.

For heterogeneous parameter blocks, PD-VI is extended to P²D-VI. The parameter λ\lambda is partitioned into BB blocks with block-specific penalties ηj\eta_j and block-wise preconditioner Dη=diag(1/η1Id1,,1/ηBIdB)D_\eta = \text{diag}(1/\eta_1\, I_{d_1},\dots,1/\eta_B\, I_{d_B}). The penalty norm becomes 12λiλ0Dη2\frac{1}{2}\|\lambda_i - \lambda_0\|_{D_\eta}^2 and the dual update μiμi+Dη(λiλ0)\mu_i \leftarrow \mu_i + D_\eta(\lambda_i - \lambda_0). Oracle II is:

(ϕit,λit)=argminϕi,λifi(ϕi,λi)+μit1,λiλ0t1+12λiλ0t1Dη2,(\phi_i^t,\lambda_i^t) = \arg\min_{\phi_i,\lambda_i} f_i(\phi_i,\lambda_i) + \langle\mu_i^{t-1}, \lambda_i-\lambda_0^{t-1}\rangle + \frac{1}{2}\|\lambda_i-\lambda_0^{t-1}\|_{D_\eta}^2,

μit=μit1+Dη(λitλ0t1).\mu_i^t = \mu_i^{t-1} + D_\eta (\lambda_i^t - \lambda_0^{t-1}).

Block-specific penalties ηj\eta_j are chosen as ηj1/Lλ,j\eta_j \propto 1/L_{\lambda, j}, where Lλ,jL_{\lambda, j} represents the Lipschitz constant for block jj.

Mini-Batch Primal–Dual VI (PD-VI) Pseudocode

1
2
3
4
5
6
7
8
9
10
11
Initialize λ_0, η, T
λ_i  λ_0, μ_i  0, h  0 for all i
for t in 1  T:
    Sample mini-batch S_t of size m
    for i in S_t (in parallel):
        (φ_i^t, λ_i^t) = argmin_{φ_i,λ_i} f_i(φ_i,λ_i) + <μ_i^{t1},λ_iλ_0^{t1}> + 1/(2η)||λ_iλ_0^{t1}||^2
        μ_i^t = μ_i^{t1} + 1/η (λ_i^t  λ_0^{t1})
    for i not in S_t:
        φ_i^t = φ_i^{t-1}, λ_i^t = λ_i^{t-1}, μ_i^t = μ_i^{t-1}
    h^t = h^{t-1} + (1/n) _{iS_t}(λ_i^t  λ_0^{t1})
    λ_0^t = (1/m) _{iS_t} λ_i^t + h^t

Block-Preconditioned Oracle (P²D-VI) Pseudocode

1
2
3
Input: f_i, λ_0, μ_i, block-diag D_η
(φ_i^t, λ_i^t) = argmin_{φ_i, λ_i} f_i(φ_i, λ_i) + <μ_i, λ_i  λ_0> + 0.5 * ||λ_i  λ_0||_{D_η}^2
μ_i^t = μ_i + D_η (λ_i^t  λ_0)

Both algorithms employ constant step sizes, enable parallel parameter updates, and adapt to geometric differences in parameter blocks.

4. Convergence Properties

PD-VI and P²D-VI achieve provable convergence rates under mild smoothness assumptions without relying on conjugacy or explicit variance control. For nonconvex fif_i that are strongly convex in ϕi\phi_i and step size ηO(ω/(L(1+κ)))\eta \leq O(\omega/(L(1+\kappa))) (with ω=m/n\omega = m/n, κ=L2/μ2\kappa = L^2/\mu^2), PD-VI yields

1Tt=1TEλf(ϕt1,λ0t1)2=O(1/T).\frac{1}{T} \sum_{t=1}^T \mathbb{E}\big\| \nabla_\lambda f(\phi^{t-1}, \lambda_0^{t-1}) \big\|^2 = O(1/T).

For convex fif_i, the expected objective gap for averaged iterates is O(1/T)O(1/T), and for strongly convex fif_i, the weighted iterate gap contracts at rate O(1/rT)O(1/r^T) with r=1+μηr=1+\mu \eta. For P²D-VI, letting S=jηj2Lj2S = \sum_j \eta_j^2 L_j^2 and {ηj}\{\eta_j\} chosen such that SCS \leq C (constant), the same O(1/T)O(1/T) guarantee holds in the block-diagonal norm

vH2=jηjvj2,\|v\|^2_H = \sum_j \eta_j \|v_j\|^2,

enabling blockwise weighted descent (Lyu et al., 7 Feb 2026).

5. Empirical Evaluation

Empirical validation encompasses large-scale synthetic and real-world datasets. On synthetic Gaussian mixtures (100,000 points, 10 dimensions, 5 clusters) with biased mini-batches, both PD-VI and P²D-VI demonstrate faster convergence and reach lower Wasserstein distance to the true mixture than SVI, SGD, Adam, and CV-based methods. For spatial transcriptomics (MOSTA dataset: ≈150,000 spatial spots, 20,000 genes) using a non-conjugate Potts-augmented mixture model, PD-VI and especially P²D-VI achieve lower ELBO values, smaller gradient norms, and higher adjusted Rand index (ARI) in fewer iterations compared to SVI, RMSProp, and Adam. Domain clustering maps produced by these methods exhibit sharper and more anatomically coherent results, and the block-preconditioned extension confers further performance improvements.

6. Methodological Significance and Context

PD-VI and its block-preconditioned variant allow for scalable and robust variational inference in non-conjugate and high-dimensional settings. By employing primal-dual strategies native to constrained finite-sum optimization, the methodology simultaneously updates local and global variational parameters within a mini-batch framework without diminishing theoretical guarantees. The incorporation of block-adaptive penalties via P²D-VI offers improved adaptation to curvature heterogeneity, overcoming limitations of isotropic penalty approaches typical of classical SVI updates. This methodology is particularly advantageous where parameter block geometry varies or loss landscapes are highly anisotropic (Lyu et al., 7 Feb 2026).

A central distinction of PD-VI lies in its formulation as an augmented Lagrangian saddle-point problem, which contrasts with classical mean-field variational inference strategies (e.g., coordinate ascent VI, standard SVI), that rely on explicit ELBO maximization and often require either closed-form updates or careful tuning for stochastic optimization. PD-VI’s independence from conjugacy and bounded-gradient-variance assumptions substantially broadens its applicability. The use of primal-dual updates and block-diagonal preconditioning situates the method in the broader context of modern stochastic optimization with consensus constraints and blockwise adaptive learning rates. This suggests that PD-VI and P²D-VI may serve as templates for scalable posterior approximation in latent variable models across diverse scientific domains.


Reference: "Scalable Mean-Field Variational Inference via Preconditioned Primal-Dual Optimization" (Lyu et al., 7 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Primal-Dual Variational Inference (PD-VI).