On the Convergence of DP-SGD with Adaptive Clipping

Published 27 Dec 2024 in cs.LG, cs.CR, math.OC, and stat.ML | (2412.19916v1)

Abstract: Stochastic Gradient Descent (SGD) with gradient clipping is a powerful technique for enabling differentially private optimization. Although prior works extensively investigated clipping with a constant threshold, private training remains highly sensitive to threshold selection, which can be expensive or even infeasible to tune. This sensitivity motivates the development of adaptive approaches, such as quantile clipping, which have demonstrated empirical success but lack a solid theoretical understanding. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD). We demonstrate that QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but show how this can be mitigated through a carefully designed quantile and step size schedule. Our analysis reveals crucial relationships between quantile selection, step size, and convergence behavior, providing practical guidelines for parameter selection. We extend these results to differentially private optimization, establishing the first theoretical guarantees for DP-QC-SGD. Our findings provide theoretical foundations for widely used adaptive clipping heuristic and highlight open avenues for future research.

Abstract PDF Upgrade to Chat

Summary

The paper introduces QC-SGD, demonstrating its convergence behavior in nonconvex settings while identifying an inherent bias due to fixed quantile schedules.
It shows that a joint schedule of increasing quantiles and decreasing step sizes can eliminate bias and achieve optimal convergence rates.
The study extends the analysis to DP-QC-SGD, integrating Gaussian noise with adaptive clipping to provide the first theoretical guarantees for private learning.

Convergence Analysis of DP-SGD with Adaptive Quantile Clipping

Motivation and Background

@@@@1@@@@ in stochastic optimization critically relies on controlling the sensitivity of gradient updates, typically achieved through gradient clipping. The standard DP-SGD algorithm enforces a fixed threshold for gradient clipping, which introduces an additional hyperparameter (the clipping norm) that must be carefully tuned. Selection of this threshold is challenging and privacy-costly, as tuning itself leaks privacy. To address this, adaptive clipping strategies such as quantile clipping have emerged, where the clipping norm is dynamically set to a chosen quantile of the observed gradient norm distribution. While empirical evidence suggests strong practical utility, especially in federated and private learning contexts, robust theoretical guarantees for adaptive clipping have been missing.

Problem Formulation and Assumptions

The paper investigates convergence behavior of stochastic gradient descent (SGD) with quantile clipping (QC-SGD) under nonconvex settings. The scenario considered is to minimize the expected loss $f(x) = \mathbb{E}_{\xi \sim D}[f_{\xi}(x)]$ over parameter vector $x \in \mathbb{R}^d$ , assuming (i) unbiased stochastic gradients with bounded $q$ -th moments—generalizing classical bounded variance to heavy-tailed noise $(q \in (1,2])$ , and (ii) $L$ -smoothness and lower-boundedness of $f$ . For each iteration, the per-sample stochastic gradient is adaptively clipped to the $p$ -th quantile of the gradient norm distribution.

QC-SGD Algorithm and Bias Analysis

The update rule for QC-SGD is $x^{t+1} = x^{t} - \gamma_t g^t$ where $g^t = \alpha_{\xi^t}(x^t) \nabla f_{\xi^t}(x^t)$ and $\alpha_{\xi}(x) = \min\{1, \frac{\tau(x)}{\|\nabla f_\xi(x)\|}\}$ with $\tau(x)$ set as the empirical $p$ -th quantile of $\|\nabla f_\xi(x)\|$ . The analysis demonstrates that, similar to constant clipping, QC-SGD introduces an irreducible bias: the gradient estimator, under a fixed quantile schedule, fails to be unbiased, preventing exact convergence to stationary points. This bias is formally quantified and related to the choice of $p$ and the step size $\gamma_t$ .

For constant parameters, the expected gradient norm $\frac{1}{T}\sum_{t=0}^{T-1} \|\nabla f(x^t)\|$ can only be guaranteed to converge to a neighborhood of zero, determined by the clipping aggressiveness (lower $p$ yields larger neighborhoods). Reduction of bias via a static setting is not possible regardless of step size, in contrast to standard SGD or non-private clipped SGD where bias may be eliminated by increasing the threshold.

Time-Varying and Adaptive Schedules

The analysis extends to time-varying quantile ( $p_t$ ) and step size ( $\gamma_t$ ) schedules. By jointly increasing quantile and decreasing step size appropriately (e.g., $\gamma_t = \tilde{O}(t^{\theta-1}), p_t = 1 - \tilde{O}(t^\nu)$ ), the bias induced by quantile clipping can be eventually eliminated, allowing convergence to stationary points. For bounded variance ( $q=2$ ), the optimal complexity is achieved with $\gamma_t = \tilde{O}(t^{-2/3})$ and $p_t$ increasing as $1 - \tilde{O}(t^{-1/3})$ , resulting in a convergence rate of $\tilde{O}(T^{-1/3})$ . This result formalizes guidelines for adaptive schedules that reduce clipping bias.

Comparison with Fixed Clipping and Implications for DP-SGD

In comparison with constant clipping, the analysis underscores that raising clipping thresholds over time can, in theory, eliminate bias but is problematic for privacy-preserving settings since higher thresholds necessitate addition of larger DP noise per iteration, degrading utility. Adaptive clipping is thus theoretically preferable for private settings, offering utility comparable to carefully tuned constant norms but requiring no hyperparameter search.

Differentially Private QC-SGD: Theory and Guarantees

The DP-QC-SGD algorithm adapts QC-SGD for private learning by adding Gaussian noise with variance proportional to the dynamic clipping threshold. The convergence bound is derived, showing that DP-QC-SGD converges to a neighborhood of a stationary point, with the neighborhood size and learning rate jointly influenced by the quantile schedule, DP noise multiplier, and mini-batch size. Interestingly, mini-batching improves stochasticity but does not mitigate the bias from clipping. The analysis provides the first theoretical guarantee for private adaptive clipping, aligning with empirical observations that adaptive clipping retains privacy while maintaining utility.

Practical Implications and Future Directions

The main implication is that adaptive quantile clipping, when combined with appropriate learning rate schedules, provides a theoretically sound and practical approach for private optimization, reducing the sensitivity to hyperparameter tuning and improving robustness against outliers and heavy-tailed noise. However, the theoretical results assume access to exact quantile estimates, which may be infeasible in decentralized or cross-device federated learning applications. Real-world implementations must use approximate quantile computations, introducing an additional layer of analysis. Empirical reports of suboptimal adaptive clipping on certain datasets also motivate further investigation.

Conclusion

This work delivers a comprehensive convergence analysis for SGD with quantile clipping, bridging the theoretical gap for adaptive clipping heuristics in private optimization. The findings clarify the interplay between quantile selection, learning rate scheduling, and convergence behavior, and establish the first theoretical guarantees for DP-QC-SGD. The theoretical bias limitation of fixed quantile schedules is highlighted, and mitigation strategies via time-varying schedules are proposed. Future directions include refining quantile estimation procedures, extending the analysis to heterogeneous data/federated settings, and exploring bias-reducing algorithmic modifications for robust and efficient private learning.

Reference: "On the Convergence of DP-SGD with Adaptive Clipping" (2412.19916)

Markdown Report Issue