Papers
Topics
Authors
Recent
Search
2000 character limit reached

High Probability Guarantees for Random Reshuffling

Published 20 Nov 2023 in math.OC and cs.LG | (2311.11841v3)

Abstract: We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we provide high probability first-order and second-order complexity guarantees for this method. First, we establish a high probability first-order sample complexity result for driving the Euclidean norm of the gradient (without taking expectation) below $\varepsilon$. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. We then propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove a high probability first-order complexity guarantee for the last iterate. Second, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and establish a high probability second-order complexity result, without requiring any sub-Gaussian tail-type assumptions on the stochastic gradient errors. The fundamental ingredient in deriving the aforementioned results is the new concentration property for sampling without replacement in $\mathsf{RR}$, which could be of independent interest. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, 2011.
  2. LĂ©on Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, volume 8, pages 2624–2633, 2009.
  3. LĂ©on Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer, 2012.
  4. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
  5. Tighter lower bounds for shuffling sgd: Random permutations and beyond. International Conference on Machine Learning, 2023.
  6. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.
  7. High-probability bounds for non-convex stochastic optimization with heavy tails. Adv. in Neural Information Processing Systems, 34:4883–4895, 2021.
  8. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
  9. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In International Conference on Machine Learning, 2017.
  10. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013.
  11. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Adv. in Neural Info. Processing Systems, 2020.
  12. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.
  13. Convergence rate of incremental gradient and incremental Newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019.
  14. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1-2):49–84, 2021.
  15. Random shuffling beats SGD after finite epochs. In International Conference on Machine Learning, pages 2624–2633, 2019.
  16. Tight analyses for non-smooth stochastic gradient descent. Annual Conference Computational Learning Theory, 2018.
  17. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  18. An improved analysis and rates for variance reduction under without-replacement sampling orders. Advances in Neural Information Processing Systems, 34, 2021.
  19. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732, 2017.
  20. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM, 68(2):1–29, 2021.
  21. Better theory for sgd in the nonconvex world. Transactions on Machine Learning Research, 2022.
  22. Using statistics to automate stochastic optimization. In Advances in Neural Information Processing Systems, 2019.
  23. A unified convergence theorem for stochastic optimization methods. In Advances in Neural Information Processing Systems, volume 35, pages 33107–33119, 2022.
  24. Convergence of random reshuffling under the kurdyka-Ƃojasiewicz inequality. SIAM Journal on Optimization, 33(2):1092–1120, 2023.
  25. GraB: Finding provably better data permutations than random reshuffling. Neural Information Processing Systems, 2022.
  26. Random reshuffling with variance reduction: New analysis and better rates. Conference on Uncertainty in Arti. Intell., 2021.
  27. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 2020.
  28. Surpassing gradient descent provably: A cyclic incremental method with linear convergence rate. SIAM Journal on Optimization, 28(2):1420–1447, 2018.
  29. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1):109–138, 2001.
  30. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  31. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
  32. Vivak Patel. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Mathematical Programming, 195(1-2):693–734, 2022.
  33. Closing the convergence gap of SGD without replacement. International Conference On Machine Learning, 2020.
  34. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
  35. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  36. How good is SGD with random shuffling? In Conference on Learning Theory, volume 125, pages 3250–3284, 2020.
  37. Ruo-Yu Sun. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020.
  38. Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends¼ in Machine Learning, 8(1-2):1–230, 2015.
  39. George Yin. A stopping rule for the Robbins-Monro method. Journal of Optimization Theory and Applications, 67(1):151–173, 1990.
  40. Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In International Conference on Learning Representations, 2022.
  41. Why are adaptive methods good for attention models? Adv. in Neu. Info. Process. Systems, 2020.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.