Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuSGD: Multi-Target Sampling Gradient Descent

Updated 26 January 2026
  • MuSGD is a sampling algorithm that employs composite Stein directions to simultaneously target multiple unnormalized distributions.
  • It integrates RKHS-based Stein gradients with a convex QP to determine optimal weightings, ensuring descent for all KL divergences.
  • Empirical results show that MuSGD outperforms traditional methods like MGDA and linear scalarization in accuracy and calibration on datasets such as CelebA and multi-MNIST.

MuSGD (Stochastic Multiple Target Sampling Gradient Descent) is a sampling algorithm designed to simultaneously handle multiple unnormalized target distributions. It extends Stein Variational Gradient Descent (SVGD) into the multi-target/multi-objective domain by iteratively updating a population of particles via composite Stein directions. MuSGD is primarily intended for probabilistic inference and multi-task learning, with theoretical guarantees and empirical benefits over classical approaches such as linear scalarization and multi-gradient descent algorithms (Phan et al., 2022).

1. Mathematical Formulation

Given KK unnormalized target densities {πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K, with xRdx\in\mathbb{R}^d, the goal is to construct a sequence of intermediate distributions (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L) that progressively move closer to the joint high-density region of all {πk}\{\pi_k\}. Each step is achieved by a push-forward update:

qt+1=Tt#qt,Tt(x)=x+ϵtϕt(x),q_{t+1} = T_t\# q_t, \quad T_t(x) = x + \epsilon_t \phi_t(x),

where ϕt(x)\phi_t(x) is a transport field constructed to minimize all KL divergences [KL(qπ1),,KL(qπK)][\mathrm{KL}(q \| \pi_1), \dots, \mathrm{KL}(q \| \pi_K)] simultaneously, treating the task as a multi-objective problem over the space of densities. The optimization objective is:

minqQ[KL(qπ1),,KL(qπK)]\min_{q \in \mathcal{Q}} [\mathrm{KL}(q \| \pi_1), \dots, \mathrm{KL}(q \| \pi_K)]

2. Gradient Flow, Stein Directions, and Update Rule

Continuous-Time Gradient Flow

For each target kk, define the Stein-variational direction in the reproducing kernel Hilbert space (RKHS) {πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K0:

{πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K1

To ensure descent for all KL objectives, MuSGD finds an optimal convex weight vector {πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K2 by solving the quadratic program:

{πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K3

Construct the composite descent direction:

{πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K4

Within the mean-field limit, the particle SDE is governed by:

{πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K5

Discrete-Time MuSGD Update

Particles {πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K6 represent the empirical distribution {πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K7. The discrete update executes:

  1. For all {πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K8 and {πk(x)}k=1K\{\pi_k(x)\}_{k=1}^K9, compute

xRdx\in\mathbb{R}^d0

  1. Build Gram matrix xRdx\in\mathbb{R}^d1 using Monte Carlo inner products.
  2. Solve xRdx\in\mathbb{R}^d2 (small QP).
  3. Compute composite direction xRdx\in\mathbb{R}^d3 as above.
  4. Update each particle:

xRdx\in\mathbb{R}^d4

Expanded form:

xRdx\in\mathbb{R}^d5

3. Theoretical Properties and Connections

In the limit of infinite RBF kernel bandwidth (or a single particle), the kernel repulsive term is eliminated and xRdx\in\mathbb{R}^d6, yielding:

xRdx\in\mathbb{R}^d7

If xRdx\in\mathbb{R}^d8, this matches the multi-gradient descent (MGDA) direction. The paper proves that as xRdx\in\mathbb{R}^d9, kernel bandwidth (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)0, and step size (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)1, MuSGD exactly recovers the classical MGDA update (Phan et al., 2022).

Under standard smoothness and Lipschitz continuity of (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)2 and the kernel, both the continuous-time flow and discrete MuSGD trajectories converge as (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)3. All KL divergences decrease at each iteration:

(q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)4

4. Algorithmic Workflow and Computational Complexity

The standard MuSGD pseudocode is: {πk}\{\pi_k\}5 Complexity per iteration:

  • Kernel computations and Stein terms: (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)5
  • Building (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)6: (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)7
  • QP on simplex: (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)8

Memory footprint includes (q0q1qL)(q_0 \to q_1 \to \cdots \to q_L)9 particles, {πk}\{\pi_k\}0 Stein fields at {πk}\{\pi_k\}1 points, and the {πk}\{\pi_k\}2 matrix {πk}\{\pi_k\}3.

5. Empirical Performance and Applications

Sampling Accuracy

On synthetic problems (e.g., mixtures of Gaussians in {πk}\{\pi_k\}4), MuSGD particles concentrate in the true joint high-density region, outperforming methods such as MOO-SVGD, which scatter across separate modes.

Multi-Task Learning

MuSGD has been evaluated on multi-MNIST, multi-FashionMNIST, CelebA (10 attributes), SARCOS regression datasets. Using architectures such as LeNet or ResNet-18, MuSGD alternates particle-based sampling for shared parameters via MT-SGD and task-specific parameters via SVGD. Metrics considered are ensemble accuracy, Brier score, and Expected Calibration Error (ECE).

Extracted empirical results:

  • On CelebA, MuSGD attains highest mean accuracy (89.0% vs. 88.2% for MOO-SVGD) and lowest ECE (2.0% vs. 2.5%).
  • On SARCOS regression, MuSGD yields the lowest RMSE for all outputs (0.0428 vs. 0.0515 for MOO-SVGD).
  • MuSGD consistently outperforms single-task SGD, linear scalarization, MGDA, Pareto MTL, and MOO-SVGD in both accuracy (+1–2%) and calibration (lower ECE), across varied tasks (Phan et al., 2022).

6. Interpretation and Practical Guidelines

MuSGD generalizes SVGD to the multi-objective setting by:

  • Combining multiple Stein directions through simplex QP solving.
  • Retaining kernel-based repulsion for diversity among particles.
  • Guaranteeing global theoretical descent for all KL objectives.
  • Asymptotically coinciding with classical multi-gradient descent.
  • Demonstrably enhanced joint sampling and downstream multi-task generalization under standard smoothness conditions.

A plausible implication is that MuSGD represents the first kernelized variant of multi-gradient descent, yielding improved empirical sampling efficiency and predictive calibration in multi-objective scenarios. The algorithm is readily adaptable to modern deep learning architectures, provided access to the Stein gradients and kernel functions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuSGD Optimizer.