Papers
Topics
Authors
Recent
Search
2000 character limit reached

How to Boost Any Loss Function

Published 2 Jul 2024 in cs.LG and stat.ML | (2407.02279v2)

Abstract: Boosting is a highly successful ML-born optimization setting in which one is required to computationally efficiently learn arbitrarily good models based on the access to a weak learner oracle, providing classifiers performing at least slightly differently from random guessing. A key difference with gradient-based optimization is that boosting's original model does not requires access to first order information about a loss, yet the decades long history of boosting has quickly evolved it into a first order optimization setting -- sometimes even wrongfully defining it as such. Owing to recent progress extending gradient-based optimization to use only a loss' zeroth ($0{th}$) order information to learn, this begs the question: what loss functions can be efficiently optimized with boosting and what is the information really needed for boosting to meet the original boosting blueprint's requirements? We provide a constructive formal answer essentially showing that any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical $0{th}$ order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical field -- not to be confounded with quantum computation -- that studies calculus without passing to the limit, and thus without using first order information.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. A gradient estimator via l1-randomization for online zero-order optimization with two point feedback. In NeurIPS*35, 2022.
  2. Exploiting higher order smoothness in derivative-free optimization and continuous bandits. In NeurIPS*33, 2020.
  3. Distributed zero-order optimisation under adversarial noise. In NeurIPS*34, 2021.
  4. Boosting simple learners. In STOC’21, 2021.
  5. S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.
  6. F. Bach. Learning Theory from First Principles. Course notes, MIT press (to appear), 2023.
  7. Clustering with bregman divergences. In Proc. of the 4t⁢hsuperscript4𝑡ℎ4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT SIAM International Conference on Data Mining, pages 234–245, 2004.
  8. Convexity, classification, and risk bounds. J. of the Am. Stat. Assoc., 101:138–156, 2006.
  9. P.-L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002.
  10. Accelerated gradient boosting. Mach. Learn., 108(6):971–992, 2019.
  11. Learning with Fenchel-Young losses. J. Mach. Learn. Res., 21:35:1–35:69, 2020.
  12. L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200–217, 1967.
  13. S. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3-4):231–357, 2015.
  14. P.-S. Bullen. Handbook of means and their inequalities. Kluwer Academic Publishers, 2003.
  15. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In 38th ICML, pages 1193–1203, 2021.
  16. C. Cartis and L. Roberts. Scalable subspace methods for derivative-free nonlinear least-squares optimization. Math. Prog., 199:461–524, 2023.
  17. Non-convex boosting overcomes random label noise. CoRR, abs/1409.2905, 2014.
  18. Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 5219–5233. PMLR, 2023.
  19. ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization. In NeurIPS*32, 2019.
  20. Improve single-point zeroth-order optimization using high-pass and low-pass filters. In 39th ICML, volume 162 of Proceedings of Machine Learning Research, pages 3603–3620. PMLR, 2022.
  21. On the convergence of prior-guided zeroth-order optimisation algorithms. In NeurIPS*34, 2021.
  22. Z. Cranko and R. Nock. Boosted density estimation remastered. In 36th ICML, pages 1416–1425, 2019.
  23. Zeroth-order hard-thresholding: Gradient error vs. expansivity. In NeurIPS*35, 2022.
  24. D. Dua and C. Graff. UCI machine learning repository, 2021.
  25. E. Fermi and N. Metropolis. Numerical solutions of a minimum problem. Technical Report TR LA-1492, Los Alamos Scientific Laboratory of the University of California, 1952.
  26. Efficiently avoiding saddle points with zero order methods: No gradients required. In NeurIPS*32, 2019.
  27. H. Gao and H. Huang. Can stochastic zeroth-order frank-wolfe method converge faster for non-convex problems? In 37th ICML, pages 3377–3386, 2020.
  28. Zeroth-order non-convex learning via hierarchical dual averaging. In 38th ICML, pages 4192–4202, 2021.
  29. Accelerated stochastic gradient-free and projection-free methods. In 37th ICML, pages 4519–4530, 2020.
  30. Neural network accelerated implicit filtering: Integrating neural network surrogates with provably convergent derivative free optimization methods. In 40th ICML, volume 202 of Proceedings of Machine Learning Research, pages 14376–14389. PMLR, 2023.
  31. V. Kac and P. Cheung. Quantum calculus. Springer, 2002.
  32. An Introduction to Computational Learning Theory. M.I.T. Press, 1994.
  33. M.J. Kearns. Thoughts on hypothesis boosting, 1988. ML class project.
  34. M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109–128, 1999.
  35. Derivative-free optimization methods. Acta Numerica, pages 287–404, 2019.
  36. Zeroth-order optimization for composite problems with functional constraints. In AAAI’22, pages 7453–7461. AAAI Press, 2022.
  37. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization. In NeurIPS*35, 2022.
  38. Random classification noise defeats all convex potential boosters. MLJ, 78(3):287–304, 2010.
  39. Zeroth-order methods for convex-concave minmax problems: applications to decision-dependent risk minimization. In 25th AISTATS, 2022.
  40. Random classification noise does not defeat all convex potential boosters irrespective of model choice. In 40th ICML, 2023.
  41. E. Mhanna and M. Assaad. Single point-based distributed zeroth-order optimization with a non-convex stochastic objective function. In 40th ICML, volume 202 of Proceedings of Machine Learning Research, pages 24701–24719. PMLR, 2023.
  42. Foundations of Machine Learning. MIT Press, 2018.
  43. Y. Nesterov and V. Spokoiny. Random gradient-free optimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
  44. F. Nielsen and R. Nock. The Bregman chord divergence. In Geometric Science of Information - 4th International Conference, 2019, pages 299–308, 2019.
  45. R. Nock and A. K. Menon. Supervised learning: No loss no cry. In 37th ICML, 2020.
  46. R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048–2059, 2009.
  47. R. Nock and R.-C. Williamson. Lossless or quantized boosting with integer arithmetic. In 36th ICML, pages 4829–4838, 2019.
  48. IPBoost - non-convex boosting via integer programming. In 37th ICML, volume 119, pages 7663–7672, 2020.
  49. Zeroth-order methods for nondifferentiable, nonconvex and hierarchical federated optimization. In NeurIPS*36, 2023.
  50. Structured zeroth-order for non-smooth optimization. In NeurIPS*36, 2023.
  51. Composite binary losses. JMLR, 11:2387–2422, 2010.
  52. Information, divergence and risk for binary experiments. JMLR, 12:731–817, 2011.
  53. Escaping saddle points in zeroth-order optimization: the power of two-point estimators. In 40th ICML, volume 202 of Proceedings of Machine Learning Research, pages 28914–28975. PMLR, 2023.
  54. Towards gradient free and projection free stochastic optimization. In 22nd AISTATS, pages 3468–3477, 2019.
  55. Gradient-free method for heavily constrained nonconvex optimization. In 39th ICML, volume 162 of Proceedings of Machine Learning Research, pages 19935–19955. PMLR, 2022.
  56. Tutorial: Survey of boosting from an optimization perspective. In 26th ICML, 2009.
  57. T. Werner and P. Ruckdeschel. The column measure and gradient-free gradient boosting, 2019.
  58. H. Zhang and B. Gu. Faster gradient-free methods for escaping saddle points. In ICLR’23. OpenReview.net, 2023.
  59. Zeroth-order negative curvature finding: Escaping saddle points without gradients. In NeurIPS*35, 2022.

Summary

  • The paper's main contribution is introducing secantboost, a boosting framework that optimizes non-convex and discontinuous loss functions using only zero-order information.
  • It details an algorithm that leverages v-derivatives and Bregman secant distortions to compute adaptive coefficients and update model weights without derivative data.
  • Empirical results validate the approach with proven convergence and competitive boosting rates, broadening the applicability to complex machine learning problems.

Boosting with Zero Order Information

The paper under discussion presents novel insights into the optimization of ML models through boosting, yet without relying on first-order derivative information. This contrasts with the traditional understanding of boosting, which has evolved to heavily leverage first-order optimization techniques. The authors offer a comprehensive framework demonstrating that boosting can be effectively carried out using only zero-order information—specifically, loss function values, without derivative access.

Core Contributions

The paper's core contribution centers on establishing that any loss function can be optimized using boosting without requiring it to be convex, differentiable, or Lipschitz continuous. The authors achieve this by leveraging concepts from quantum calculus, thereby enabling zeroth-order optimization. This broader applicability to non-convex, non-differentiable, and even discontinuous loss functions extends the utility of boosting considerably.

Theoretical Foundations

The authors formulate a boosting algorithm, aptly named secantboost, which relies on "v-derivatives" and "Bregman secant distortions." This innovative approach avoids traditional derivatives, instead using secant lines to approximate gradients. The function and approach are detailed as follows:

  1. Initialization: The procedure initializes an ensemble model H_0 and weights derived from the v-derivative of the loss function.
  2. Weak Learner: At each iteration, a weak learner is employed to generate a classifier using re-weighted training examples.
  3. Computing Leveraging Coefficient: An adaptive leveraging coefficient α_t is computed, which doesn't rely on derivative information but instead on the secant of the loss function values.
  4. Model Update: The ensemble model is updated, and an offset oracle is introduced to determine the weight adjustments using secant lines.
  5. Weight Update: Weights are updated based on the v-derivative, ensuring that no first-order information is utilized.
  6. Stopping Criteria: Early stopping is integrated to terminate the algorithm if all weights converge to zero.

Numerical Results

The authors substantiate their theoretical claims with empirical results that validate the efficacy of secantboost. Significant points include:

  • Convergence Proof: The paper presents a rigorous proof demonstrating the convergence of secantboost to a local minimum, given a sufficiently large number of iterations.
  • Boosting Rate: It shows that boosting can achieve competitive rates without requiring gradients, particularly emphasizing that the weak learning assumption (γ-WLA) holds.

Implications and Future Directions

  • Broad Applicability: By eliminating the need for first-order information, this boosting framework can be applied to a wider variety of loss functions and ML problems, including those with complex, non-smooth objective landscapes.
  • Offset Oracle: The introduction of the offset oracle is particularly noteworthy. It allows the algorithm to adjust to various forms of the loss surface, ensuring that weight updates are feasible even with discontinuous losses.
  • Alternative to Gradient Descent: This work situates boosting as a viable alternative to gradient descent-based methods for certain classes of problems, particularly when derivative information is either unavailable or expensive to compute.

Speculations on Future AI Developments

The repercussions of this approach might influence future AI developments in several ways:

  1. Enhanced Robustness: Models optimized using secantboost could exhibit greater robustness to noisy and non-differentiable objective functions, making them more suitable for real-world applications where such issues are prevalent.
  2. Deployment in Non-Euclidean Spaces: The principles could be extended to boosting in non-Euclidean spaces, where traditional gradient techniques are less effective.
  3. Utility in Black-Box Models: The method's reliance on function evaluations rather than gradients aligns well with optimization in black-box settings, expanding its utility in complex and high-dimensional search spaces.

Conclusion

Overall, this paper presents a significant advancement by illustrating how boosting can be effectively implemented using only zero-order information. It opens new avenues for boosting algorithms to be applied more broadly and robustly, thus enriching the toolkit for ML practitioners. Future research might explore further refinements in the offset oracle's implementation, enhanced ways to leverage weak learners in zero-order settings, and application to new domains and loss functions.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 80 likes about this paper.