Differentially Private Sparse Linear Regression with Heavy-tailed Responses

Published 7 Jun 2025 in cs.LG and cs.CR | (2506.06861v1)

Abstract: As a fundamental problem in machine learning and differential privacy (DP), DP linear regression has been extensively studied. However, most existing methods focus primarily on either regular data distributions or low-dimensional cases with irregular data. To address these limitations, this paper provides a comprehensive study of DP sparse linear regression with heavy-tailed responses in high-dimensional settings. In the first part, we introduce the DP-IHT-H method, which leverages the Huber loss and private iterative hard thresholding to achieve an estimation error bound of ( \tilde{O}\biggl( s^{* \frac{1 }{2}} \cdot \biggl(\frac{\log d}{n}\biggr)^{{\frac{\zeta}{1} + \zeta}} + s^{* \frac{1 + 2\zeta}{2 + 2\zeta}} \cdot \biggl(\frac{\log² d}{n \varepsilon}\biggr)^{{\frac{\zeta}{1} + \zeta}} \biggr) ) under the $(\varepsilon, \delta)$-DP model, where $n$ is the sample size, $d$ is the dimensionality, $s^*$ is the sparsity of the parameter, and $\zeta \in (0, 1]$ characterizes the tail heaviness of the data. In the second part, we propose DP-IHT-L, which further improves the error bound under additional assumptions on the response and achieves ( \tilde{O}\Bigl(\frac{(s^{*)^{3/2}} \log d}{n \varepsilon}\Bigr). ) Compared to the first result, this bound is independent of the tail parameter $\zeta$. Finally, through experiments on synthetic and real-world datasets, we demonstrate that our methods outperform standard DP algorithms designed for ``regular'' data.

Abstract PDF Upgrade to Chat

Summary

The paper proposes two novel algorithms, DP-IHT-H and DP-IHT-L, that ensure differential privacy for sparse regression under heavy-tailed response distributions.
It leverages iterative hard thresholding and robust loss functions to achieve error bounds that adapt to the intrinsic tail properties of the data.
Empirical results reveal that these methods outperform traditional DP algorithms, offering reliable and private solutions for high-dimensional, sensitive datasets.

Differentially Private Sparse Linear Regression with Heavy-tailed Responses: An Analytical Review

The study presented in the paper investigates the challenges and solutions associated with differentially private (DP) sparse linear regression models when dealing with heavy-tailed data distributions. Traditional approaches in DP linear regression predominantly focus on models assuming regular or light-tailed data distributions; here, however, the authors examine a more generalized setting accommodating heavy-tailed responses. This effort marks an important contribution to both the machine learning domain, in particular to high-dimensional data modeling, and the privacy preservation field.

Algorithmic Framework and Technical Contributions

To overcome the obstacles posed by heavy-tailed data, two main algorithms are introduced: DP-IHT-H and DP-IHT-L. Their fundamental purpose is to maintain privacy guarantees while providing robust estimations for models where responses might only have finite $(1+\zeta)$ -th moments, a significant relaxation compared to existing sub-Gaussian assumptions.

DP-IHT-H Algorithm: This method utilizes the Huber loss function to mitigate the effects of heavy-tailed outliers. Despite the inherent challenges in privatizing Huber loss due to non-convexity and non-smoothness, the authors employ iterative hard thresholding combined with privacy-preserving "Peeling". This allows the algorithm to achieve promising error bounds that vary with the tail heaviness parameter $\zeta$ . Particularly notable is the error bound of

$\tilde{O}\left( s^{* \frac{1}{2} } \cdot \left(\frac{\log d}{n}\right)^{\frac{\zeta}{1 + \zeta} + s^{* \frac{1 + 2\zeta}{2 + 2\zeta} \cdot \left(\frac{\log^2 d}{n \varepsilon}\right)^{\frac{\zeta}{1 + \zeta} \right),$

indicating that the algorithm suitably manages the trade-off between differential privacy constraints and robust analytics on heavy-tailed data.

DP-IHT-L Algorithm: Aiming for consistent performance across all conditions of $\zeta$ , the paper introduces DP-IHT-L, leveraging the $\ell_1$ norm loss, which naturally possesses limited sensitivity to heavy-tailed distributions. Its key strength is independence from $\zeta$ , producing an error bound of

$\tilde{O}\left(\frac{(s^*)^{3/2} \log d}{n \varepsilon}\right),$

which matches known results for sub-Gaussian cases under similar conditions, thus representing a significant improvement over DP-IHT-H in certain scenarios.

Numerical Results and Insights

The comparative analysis reveals that both DP-IHT-H and DP-IHT-L significantly outperform conventional DP algorithms tailored to light-tailed data under scenarios involving heavy-tailed distributions, which are prevalent in domains such as finance and biomedicine. Moreover, the algorithms offer competitive results when compared to non-private solutions like adaHuber, demonstrating the intricate balance between preserving privacy and ensuring estimation accuracy. Notably, DP-IHT-L has proven advantageous as $\zeta$ changes, indicating its robustness for varied tail parameters.

Practical and Theoretical Implications

The proposed methodologies present viable options for real-world applications where data privacy is critical, yet data distributions deviate from Gaussian norms. These advancements are vital for sensitive domains leveraging high-dimensional datasets, where privacy assurance aligns with regulatory compliance needs.

Speculations on Future Developments

The exploration detailed within the paper opens pathways for further inquiry. Future research might explore refining differential privacy for adaptive algorithms beyond linear regression models or enhancing scalability for even larger datasets. Moreover, integrating DP assurances with emerging generative models could offer new angles of exploration, merging privacy with enhanced data synthesis capabilities, thus expanding applicability across AI-driven sectors.

In conclusion, the paper provides a comprehensive approach to handling heavy-tailed data under differential privacy constraints, showcasing significant theoretical and practical advancements in sparse linear regression modeling. These concepts not only extend existing frameworks but also forecast potential evolutionary trajectories in privacy-sensitive machine learning models.

Markdown Report Issue