Asymptotics of resampling without replacement in robust and logistic regression
Abstract: This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension $p$ and sample size $n$ are of the same order. For a given dataset $(X,y)\in \mathbb{R}{n\times p}\times \mathbb{R}n$ and fixed subsample ratio $q\in(0,1)$, the practitioner samples independently of $(X,y)$ iid subsets $I_1,...,I_M$ of ${1,...,n}$ of size $q n$ and trains estimators $\hat{\beta}(I_1),...,\hat{\beta}(I_M)$ on the corresponding subsets of rows of $(X, y)$. Understanding the performance of the bagged estimate $\bar{\beta} = \frac1M\sum_{m=1}M \hat{\beta}(I_1),...,\hat{\beta}(I_M)$, for instance its squared error, requires us to understand correlations between two distinct $\hat{\beta}(I_m)$ and $\hat{\beta}(I_{m'})$ trained on different subsets $I_m$ and $I_{m'}$. In robust linear regression and logistic regression, we characterize the limit in probability of the correlation between two estimates trained on different subsets of the data. The limit is characterized as the unique solution of a simple nonlinear equation. We further provide data-driven estimators that are consistent for estimating this limit. These estimators of the limiting correlation allow us to estimate the squared error of the bagged estimate $\bar{\beta}$, and for instance perform parameter tuning to choose the optimal subsample ratio $q$. As a by-product of the proof argument, we obtain the limiting distribution of the bivariate pair $(x_iT \hat{\beta}(I_m), x_iT \hat{\beta}(I_{m'}))$ for observations $i\in I_m\cap I_{m'}$, i.e., for observations used to train both estimates.
- An inequality for hilbert-schmidt norm. Communications in Mathematical Physics, 81(1):89–96, 1981.
- The lasso risk for gaussian matrices. IEEE Transactions on Information Theory, 58(4):1997–2017, 2011.
- Pierre C Bellec. Observable adjustments in single-index models for regularized m-estimators. arXiv preprint arXiv:2204.06990, 2022.
- Pierre C Bellec. Out-of-sample error estimation for m-estimators with convex penalty. Information and Inference: A Journal of the IMA, 12(4):2782–2817, 2023.
- Error estimation and adaptive tuning for unregularized robust m-estimator. arXiv preprint arXiv:2312.13257, 2023a.
- Existence of solutions to the nonlinear equations characterizing the precise error of M-estimators. arXiv preprint arXiv:2312.13254, 2023b.
- Derivatives and residual distribution of regularized m-estimators with application to adaptive tuning. In Conference on Learning Theory, pages 1912–1947. PMLR, 2022.
- Second-order stein: Sure for sure and other applications in high-dimensional inference. The Annals of Statistics, 49(4):1864–1903, 2021.
- Debiasing convex regularized estimators and interval estimation in linear models. The Annals of Statistics, 51(2):391–436, 2023.
- Corrected generalized cross-validation for finite ensembles of penalized estimators. arXiv preprint arXiv:2310.01374, 2023.
- Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
- Vladimir Igorevich Bogachev. Gaussian measures. Number 62. American Mathematical Soc., 1998.
- Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
- Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
- Leo Breiman. Using iterated bagging to debias regressions. Machine Learning, 45:261–277, 2001.
- Analyzing bagging. The annals of Statistics, 30(4):927–961, 2002.
- The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48(1):27–42, 2020.
- Arijit Chaudhuri. Modern Survey Sampling. CRC Press, 2014.
- Perturbation bounds for the polar factors. Journal of Computational Mathematics, pages 397–401, 1989.
- Analysis of bootstrap and subsampling in high-dimensional regularized regression. arXiv preprint arXiv:2402.13622, 2024.
- Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.
- High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166:935–969, 2016.
- Subsample ridge ensembles: Equivalences and generalized cross-validation. arXiv preprint arXiv:2304.13016, 2023.
- Alan Edelman. Eigenvalues and condition numbers of random matrices. SIAM journal on matrix analysis and applications, 9(4):543–560, 1988.
- Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170:95–175, 2018.
- On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013.
- Noureddine El Karoui. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445, 2013.
- The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics, pages 3525–3535. PMLR, 2020.
- Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34:18137–18151, 2021.
- Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in high-dimension. In International Conference on Machine Learning, pages 14283–14314. PMLR, 2022.
- Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319):1–113, 2023.
- The impact of regularization on high-dimensional logistic regression. Advances in Neural Information Processing Systems, 32, 2019.
- A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
- Precise error analysis of regularized m𝑚mitalic_m-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592–5628, 2018.
- J Leo van Hemmen and Tsuneya Ando. An inequality for trace ideals. Communications in Mathematical Physics, 76:143–148, 1980.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- The asymptotic distribution of the mle in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3):1835–1861, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.