Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the Convergence of DP-SGD with Adaptive Clipping

Published 27 Dec 2024 in cs.LG, cs.CR, math.OC, and stat.ML | (2412.19916v1)

Abstract: Stochastic Gradient Descent (SGD) with gradient clipping is a powerful technique for enabling differentially private optimization. Although prior works extensively investigated clipping with a constant threshold, private training remains highly sensitive to threshold selection, which can be expensive or even infeasible to tune. This sensitivity motivates the development of adaptive approaches, such as quantile clipping, which have demonstrated empirical success but lack a solid theoretical understanding. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD). We demonstrate that QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but show how this can be mitigated through a carefully designed quantile and step size schedule. Our analysis reveals crucial relationships between quantile selection, step size, and convergence behavior, providing practical guidelines for parameter selection. We extend these results to differentially private optimization, establishing the first theoretical guarantees for DP-QC-SGD. Our findings provide theoretical foundations for widely used adaptive clipping heuristic and highlight open avenues for future research.

Summary

  • The paper introduces QC-SGD, demonstrating its convergence behavior in nonconvex settings while identifying an inherent bias due to fixed quantile schedules.
  • It shows that a joint schedule of increasing quantiles and decreasing step sizes can eliminate bias and achieve optimal convergence rates.
  • The study extends the analysis to DP-QC-SGD, integrating Gaussian noise with adaptive clipping to provide the first theoretical guarantees for private learning.

Convergence Analysis of DP-SGD with Adaptive Quantile Clipping

Motivation and Background

@@@@1@@@@ in stochastic optimization critically relies on controlling the sensitivity of gradient updates, typically achieved through gradient clipping. The standard DP-SGD algorithm enforces a fixed threshold for gradient clipping, which introduces an additional hyperparameter (the clipping norm) that must be carefully tuned. Selection of this threshold is challenging and privacy-costly, as tuning itself leaks privacy. To address this, adaptive clipping strategies such as quantile clipping have emerged, where the clipping norm is dynamically set to a chosen quantile of the observed gradient norm distribution. While empirical evidence suggests strong practical utility, especially in federated and private learning contexts, robust theoretical guarantees for adaptive clipping have been missing.

Problem Formulation and Assumptions

The paper investigates convergence behavior of stochastic gradient descent (SGD) with quantile clipping (QC-SGD) under nonconvex settings. The scenario considered is to minimize the expected loss f(x)=EξD[fξ(x)]f(x) = \mathbb{E}_{\xi \sim D}[f_{\xi}(x)] over parameter vector xRdx \in \mathbb{R}^d, assuming (i) unbiased stochastic gradients with bounded qq-th moments—generalizing classical bounded variance to heavy-tailed noise (q(1,2])(q \in (1,2]), and (ii) LL-smoothness and lower-boundedness of ff. For each iteration, the per-sample stochastic gradient is adaptively clipped to the pp-th quantile of the gradient norm distribution.

QC-SGD Algorithm and Bias Analysis

The update rule for QC-SGD is xt+1=xtγtgtx^{t+1} = x^{t} - \gamma_t g^t where gt=αξt(xt)fξt(xt)g^t = \alpha_{\xi^t}(x^t) \nabla f_{\xi^t}(x^t) and αξ(x)=min{1,τ(x)fξ(x)}\alpha_{\xi}(x) = \min\{1, \frac{\tau(x)}{\|\nabla f_\xi(x)\|}\} with τ(x)\tau(x) set as the empirical pp-th quantile of fξ(x)\|\nabla f_\xi(x)\|. The analysis demonstrates that, similar to constant clipping, QC-SGD introduces an irreducible bias: the gradient estimator, under a fixed quantile schedule, fails to be unbiased, preventing exact convergence to stationary points. This bias is formally quantified and related to the choice of pp and the step size γt\gamma_t.

For constant parameters, the expected gradient norm 1Tt=0T1f(xt)\frac{1}{T}\sum_{t=0}^{T-1} \|\nabla f(x^t)\| can only be guaranteed to converge to a neighborhood of zero, determined by the clipping aggressiveness (lower pp yields larger neighborhoods). Reduction of bias via a static setting is not possible regardless of step size, in contrast to standard SGD or non-private clipped SGD where bias may be eliminated by increasing the threshold.

Time-Varying and Adaptive Schedules

The analysis extends to time-varying quantile (ptp_t) and step size (γt\gamma_t) schedules. By jointly increasing quantile and decreasing step size appropriately (e.g., γt=O~(tθ1),pt=1O~(tν)\gamma_t = \tilde{O}(t^{\theta-1}), p_t = 1 - \tilde{O}(t^\nu)), the bias induced by quantile clipping can be eventually eliminated, allowing convergence to stationary points. For bounded variance (q=2q=2), the optimal complexity is achieved with γt=O~(t2/3)\gamma_t = \tilde{O}(t^{-2/3}) and ptp_t increasing as 1O~(t1/3)1 - \tilde{O}(t^{-1/3}), resulting in a convergence rate of O~(T1/3)\tilde{O}(T^{-1/3}). This result formalizes guidelines for adaptive schedules that reduce clipping bias.

Comparison with Fixed Clipping and Implications for DP-SGD

In comparison with constant clipping, the analysis underscores that raising clipping thresholds over time can, in theory, eliminate bias but is problematic for privacy-preserving settings since higher thresholds necessitate addition of larger DP noise per iteration, degrading utility. Adaptive clipping is thus theoretically preferable for private settings, offering utility comparable to carefully tuned constant norms but requiring no hyperparameter search.

Differentially Private QC-SGD: Theory and Guarantees

The DP-QC-SGD algorithm adapts QC-SGD for private learning by adding Gaussian noise with variance proportional to the dynamic clipping threshold. The convergence bound is derived, showing that DP-QC-SGD converges to a neighborhood of a stationary point, with the neighborhood size and learning rate jointly influenced by the quantile schedule, DP noise multiplier, and mini-batch size. Interestingly, mini-batching improves stochasticity but does not mitigate the bias from clipping. The analysis provides the first theoretical guarantee for private adaptive clipping, aligning with empirical observations that adaptive clipping retains privacy while maintaining utility.

Practical Implications and Future Directions

The main implication is that adaptive quantile clipping, when combined with appropriate learning rate schedules, provides a theoretically sound and practical approach for private optimization, reducing the sensitivity to hyperparameter tuning and improving robustness against outliers and heavy-tailed noise. However, the theoretical results assume access to exact quantile estimates, which may be infeasible in decentralized or cross-device federated learning applications. Real-world implementations must use approximate quantile computations, introducing an additional layer of analysis. Empirical reports of suboptimal adaptive clipping on certain datasets also motivate further investigation.

Conclusion

This work delivers a comprehensive convergence analysis for SGD with quantile clipping, bridging the theoretical gap for adaptive clipping heuristics in private optimization. The findings clarify the interplay between quantile selection, learning rate scheduling, and convergence behavior, and establish the first theoretical guarantees for DP-QC-SGD. The theoretical bias limitation of fixed quantile schedules is highlighted, and mitigation strategies via time-varying schedules are proposed. Future directions include refining quantile estimation procedures, extending the analysis to heterogeneous data/federated settings, and exploring bias-reducing algorithmic modifications for robust and efficient private learning.


Reference: "On the Convergence of DP-SGD with Adaptive Clipping" (2412.19916)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.