Frank–Wolfe Momentum Optimization

Updated 7 December 2025

FW with momentum is a projection-free optimization method that integrates momentum mechanisms to enhance convergence and reduce oscillations in constrained settings.
It leverages variants like heavy ball and Nesterov-style acceleration with adaptable step-sizes, ensuring robust performance in high-dimensional and distributed applications.
Empirical studies confirm its practical advantages in areas such as CNN pruning and matrix completion, achieving improved test loss, sparsity, and convergence rates.

Frank–Wolfe (FW) with momentum refers to a family of projection-free first-order optimization methods that integrate momentum mechanisms into the classical FW (also known as conditional gradient) algorithm. The principal motivation is to accelerate convergence, reduce oscillations, and improve empirical efficiency for smooth convex and some nonconvex problems constrained to compact convex sets. Multiple momentum variants have been systematically developed, including heavy ball momentum and Nesterov-style accelerated momentum, each yielding distinct theoretical and practical characteristics. FW with momentum is now a mature area of research with established algorithmic templates, primal–dual performance guarantees, tailorings for high-dimensional machine learning tasks, and adaptations to distributed and stochastic optimization regimes.

1. Algorithmic Foundations of FW with Momentum

In the standard FW framework, at each iteration, a linear minimization oracle (LMO) selects a search direction by solving $v_{k+1} = \arg\min_{v\in\mathcal{X}} \langle \nabla f(x_k), v \rangle$ , followed by a convex update $x_{k+1} = (1-\eta_k) x_k + \eta_k v_{k+1}$ with a step-size $\eta_k$ . The approach avoids costly projections and is well-suited for large-scale constrained optimization where $\mathcal{X}$ (e.g., $\ell_1$ -, $\ell_2$ -, or nuclear-norm balls) has an efficient LMO.

FW with momentum generalizes this by replacing the current gradient with an aggregate (momentum) term. In the heavy ball FW (HFW) variant, the update is:

$g_{k+1} = (1-\delta_k)g_k + \delta_k \nabla f(x_k)$

where $g_k$ acts as a "momentum-averaged" gradient, and $\delta_k$ is the momentum parameter. The subsequent LMO and update steps use $g_{k+1}$ . Several choices for $(\delta_k,\eta_k)$ are documented, most notably the parameter-free rule $\delta_k = \eta_k = 2/(k+2)$ , smoothness-adapted step-size rules, and line-search options (Li et al., 2021).

Nesterov-style momentum for FW is constructed by maintaining a running average $\theta_{k+1}$ over past gradients at auxiliary points $y_k$ , yielding "Accelerated Frank–Wolfe" (AFW) updates:

$\theta_{k+1} = (1-\delta_k)\theta_k + \delta_k\nabla f(y_k), \ v_{k+1} = \arg\min_{v\in X} \langle \theta_{k+1}, v \rangle, \ x_{k+1} = (1-\delta_k)x_k + \delta_k v_{k+1}$

with suitable convex combinations for $y_k$ (Li et al., 2020).

Distributed momentum-based FW (DMFW) algorithms, such as those deployed in networked systems, operate via local momentum-filtered stochastic gradients, consensus averaging, and gradient tracking to approximate a network-wide gradient direction (Hou et al., 2022).

2. Convergence Properties and Theoretical Guarantees

FW with momentum exhibits convergence rates closely related to both the structure of $\mathcal{X}$ and specific momentum integrations. The key results include:

Heavy Ball FW (HFW) (Li et al., 2021): For convex $f$ with $L$ -Lipschitz gradient and compact $\mathcal{X}$ (diameter $D$ ), HFW with the weighted choice $\delta_k = \eta_k = 2/(k+2)$ exhibits a primal–dual gap $G_k = f(x_k) - \Phi_k(v_k)$ satisfying $G_k \leq 2LD^2/(k+1)$ for all $k \geq 1$ . This implies a $O(1/k)$ rate for both the gap and suboptimality.
Accelerated FW (AFW) (Li et al., 2020): On general convex $\mathcal{X}$ , AFW achieves the classical $O(1/k)$ rate. However, for certain sets (e.g., active $\ell_2$ -ball or $\ell_p$ -balls with a nondegeneracy condition), AFW provably accelerates to $\widetilde{O}(1/k^2)$ , matching fast rates of projected accelerated methods but using only linear minimization oracles.
FW with Momentum for CNN Pruning (Shili et al., 30 Nov 2025): When using the exponential moving average $p_t$ in LMO, practical convergence (test loss, sparsity) matches or exceeds vanilla FW, without worsening worst-case theoretical $O(1/t)$ primal–duality gap bounds.
Distributed Momentum-Based FW (Hou et al., 2022): In distributed stochastic settings, DMFW with step-sizes $\gamma_k = 2/(k+1)$ and $\eta_k = 2/(k+2)$ achieves $\mathcal{O}(k^{-1/2})$ convergence in convex cases, corresponding to the centralized momentum-FW rate, and $\mathcal{O}(1/\log k)$ rates in nonconvex cases.

A summary table of theoretical rates for prominent variants is as follows:

Method	General Convex Rate	Special Set Rate	Comments
Standard FW	$O(1/k)$	$O(1/k)$	Projection-free
HFW (Li et al., 2021)	$O(1/k)$	$O(1/k)$	Tightens constant
AFW (Li et al., 2020)	$O(1/k)$	$\widetilde{O}(1/k^2)$	Requires structure
DMFW (Hou et al., 2022)	$\mathcal{O}(k^{-1/2})$	$\mathcal{O}(k^{-1/2})$	Distributed setting

3. Step Size, Momentum Parameter Tuning, and Restart

Efficient deployment of FW with momentum hinges on proper selection of both step-size and momentum parameters, tailored to the underlying problem geometry and smoothness:

In HFW (Li et al., 2021), the recommended parameter-free rule $\delta_k=\eta_k=2/(k+2)$ is universal and ensures optimal $O(1/k)$ convergence without requiring smoothness constants.
Smoothness-adapted step-sizes can further stabilize updates, though they require estimates of $L$ .
Line-search for $\eta_k$ often yields improved empirical performance when the cost is justified.
In distributed scenarios (Hou et al., 2022), step-sizes $\gamma_k=2/(k+1)$ , $\eta_k=2/(k+2)$ are compatible with consensus and variance reduction mechanisms.

A restart scheme can be layered atop HFW, exploiting the primal–dual gap as a trigger. When the standard FW gap falls below the generalized gap, the stage is restarted, which tightens the denominator in the error bound from $k+1$ to $k+C^s$ , where $C^s$ accumulates total past iteration counts. This decreases the leading constant without altering the asymptotic rate.

4. Empirical Performance and Practical Applications

Extensive numerical studies have evaluated FW with momentum across classic and modern application domains:

Classical regression and matrix completion (Li et al., 2021, Li et al., 2020): HFW with parameter-free momentum dramatically accelerates reduction in test loss and primal–dual gap across $\ell_2$ -, $\ell_1$ -, and $k$ -support ball constraints, as well as nuclear norm-constrained matrix completion. AFW exhibits much reduced oscillatory behavior and sharper rate transitions for "active" ball constraints, as predicted by theory.
CNN pruning (Shili et al., 30 Nov 2025): FW with momentum, when used for pruning convolutional neural networks on datasets such as MNIST, yields pruned models that are sparser (≈40–45% nonzero weights vs. ≈50% for baseline pruning) and display equal or higher test accuracy, even with significantly less pre-training. The method stabilizes the pruning direction and supports efficient layerwise sparsification.
Distributed optimization (Hou et al., 2022): In network settings, DMFW advances both convex and nonconvex objectives under consensus constraints, matching or outperforming state-of-the-art projected or stochastic FW baselines in data efficiency and convergence per iteration.

Empirical findings consistently indicate that FW with momentum (either heavy ball or Nesterov-style), especially with parameter-free settings, yields the fastest decay of primal–dual error and improved robustness in high-dimensional or resource-limited environments.

5. Variants, Special Cases, and Comparative Analysis

Several flavors of momentum-enhanced FW have been proposed and benchmarked:

Heavy Ball FW (HFW): Aggregates gradients by exponential weighting; minimal memory requirements; offers tight O(1/k) PD gap bounds without extra LMO calls (Li et al., 2021).
Accelerated FW (AFW): Uses Nesterov-style averaging of supporting hyperplanes, resulting in potential $\widetilde{O}(1/k^2)$ acceleration for sets with unique, smooth LMO minimizers (Li et al., 2020).
Uniform-Average FW (UFW): Employs simple uniform averaging; empirically slower than weighted-momentum variants.
FW with Exponential Moving Average (FW+Momentum) for Structured Pruning (Shili et al., 30 Nov 2025): Empirically demonstrates improved sparsity–accuracy tradeoffs with negligible computational overhead and reduced dense pre-training requirements compared to greedy and vanilla FW methods.
Distributed Momentum FW (DMFW) (Hou et al., 2022): Merges local momentum filtering, gradient tracking, and consensus to solve global objectives across networked agents.

A plausible implication is that the practical utility of each momentum variant is context-specific. For example, AFW realizes significant gains when the feasible set geometry ensures continuity and uniqueness of the LMO solution, while HFW robustly improves constants in the general setting. In high-throughput regimes, computational simplicity and memory constraints may favor HFW or FW+momentum designs.

6. Limitations and Open Directions

Notwithstanding their established advantage in key settings, FW methods with momentum are bounded by several limitations:

Non-improvability of rates in the general case: The $O(1/k)$ rate for FW-type algorithms cannot be improved on general compact convex domains, regardless of momentum integration, as established in both theoretical analysis and empirical studies (Li et al., 2020, Li et al., 2021).
Acceleration on special feasible sets: True $O(1/k^2)$ acceleration occurs only for sets (e.g., "active" $\ell_p$ balls) where the LMO delivers unique, nondegenerate minimizers—this is not a universal guarantee for all constraint families (Li et al., 2020).
Parameter tuning and restart overhead: Adaptive or smoothness-based steps may require knowledge of $L$ and additional computations. Restart mechanisms, while improving bounds, can require additional LMO calls per iteration (Li et al., 2021).
Distributed setting challenges: The convergence rates for distributed momentum FW degrade to $\mathcal{O}(k^{-1/2})$ in the networked stochastic regime (Hou et al., 2022), which, while comparable to centralized alternatives, are not on par with the deterministic case.

Further research is directed toward expanding acceleration guarantees to broader constraint classes, optimizing per-iteration computational cost, and quantifying robustness under stochastic data and asynchrony. Applications in large-scale sparse learning, low-rank recovery, and resource-constrained neural network pruning continue to drive interest in FW with momentum.