Statistical Comparisons of Classifiers by Generalized Stochastic Dominance
Abstract: Although being a crucial question for the development of machine learning algorithms, there is still no consensus on how to compare classifiers over multiple data sets with respect to several criteria. Every comparison framework is confronted with (at least) three fundamental challenges: the multiplicity of quality criteria, the multiplicity of data sets and the randomness of the selection of data sets. In this paper, we add a fresh view to the vivid debate by adopting recent developments in decision theory. Based on so-called preference systems, our framework ranks classifiers by a generalized concept of stochastic dominance, which powerfully circumvents the cumbersome, and often even self-contradictory, reliance on aggregates. Moreover, we show that generalized stochastic dominance can be operationalized by solving easy-to-handle linear programs and moreover statistically tested employing an adapted two-sample observation-randomization test. This yields indeed a powerful framework for the statistical comparison of classifiers over multiple data sets with respect to multiple quality criteria simultaneously. We illustrate and investigate our framework in a simulation study and with a set of standard benchmark data sets.
- K. Arrow. A difficulty in the concept of social welfare. Journal of Political Economy, 58:328–346, 1950.
- G. Barrett and S. Donald. Consistent tests for stochastic dominance. Econometrica, 71(1):71–104, 2003.
- Should we really use post-hoc tests based on mean-ranks? Journal of Machine Learning Research, 17(1):152–161, 2016.
- Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. Journal of Machine Learning Research, 18(77):1–36, 2017.
- ASlib: A benchmark library for algorithm selection. Artificial Intelligence, 237:41–58, 2016.
- S. Brams and P. Fishburn. Voting procedures. In K. Arrow, A. Sen, and K. Suzumura, editors, Handbook of Social Choice and Welfare, Vol. 1, pages 173–236. North-Holland, 2002.
- Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3):251–277, 2003.
- L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
- Classification and regression trees. reprinted: 2017, Taylor & Francis, 1983.
- B. Calvo and G. Santafé. scmamp: Statistical comparison of multiple algorithms in multiple problems. The R Journal, 8(1):248–256, 2016.
- Bayesian performance analysis for black-box optimization benchmarking. In M. López-Ibáñez, editor, Genetic and Evolutionary Computation Conference, page 1789–1797. ACM, 2019.
- L. Chang. Partial order relations for classification comparisons. Canadian Journal of Statistics, 48(2):152–166, 2020.
- Statistical comparison of classifiers through Bayesian hierarchical modelling. Machine Learning, 106(11):1817–1837, 2017.
- Y. Cui. Individualized decision-making under partial identification: Three perspectives, two optimality results, and one paradox. Harvard Data Science Review, 3(3), 2021.
- Learning to optimize for stochastic dominance constraints. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 8991–9009. PMLR, 2023.
- O. Davidov and S. Peddada. Testing for the multivariate stochastic order among ordered experimental groups with application to dose–response studies. Biometrics, 69(4):982–990, 2013.
- J. de Borda. Memoire sur les elections au scrutin. Historie de l’Academie Royale des Sciences, 1781.
- C. de Campos and A. Benavoli. Joint analysis of multiple algorithms and performance measures. New Generation Computing, 35:69–86, 2017.
- N. de Condorcet. Essai sur l’application de l’analyse a la probabilite des decisions rendues a la pluralite des voix. Paris, 1785.
- K. Deb. Multi-objective optimization. In E. Burke and G. Kendall, editors, Search Methodologies, pages 403–449. Springer, 2014.
- J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
- D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Domain-based benchmark experiments: Exploratory and inferential analysis. Austrian Journal of Statistics, 41(1):5–26, 2012.
- Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(1):3133–3181, 2014.
- J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38:367–378, 2002.
- Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1–22, 2010.
- S. García and F. Herrera. An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research, 9:2677–2694, 2008.
- A review of stochastic dominance methods for poverty analysis. Journal of Economic Surveys, 33:178–191, 2019.
- Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044–2064, 2010.
- Nonparametric statistical analysis of machine learning algorithms for regression problems. In R. Setchi, I. Jordanov, R. Howlett, and L. Jain, editors, International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, volume 6276 of Lecture Notes in Artificial Intelligence, pages 111–120. Springer, 2010.
- gbm: Generalized Boosted Regression Models, 2020. URL https://CRAN.R-project.org/package=gbm. R package version 2.1.8.
- The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3):675–699, 2005.
- Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963, 2019.
- M. Jakubczyk and D. Golicki. Elicitation and modelling of imprecise utility of health states. Theory and Decision, 88(1):51–71, 2020.
- C. Jansen and T. Augustin. Decision making with state-dependent preference systems. In D. Ciucci, I. Couso, D. Medina, J.and Slezak, D. Petturiti, B. Bouchon-Meunier, and R. Yager, editors, Information Processing and Management of Uncertainty in Knowledge-Based Systems, volume 1601 of Communications in Computer and Information Science, pages 729–742. Springer, 2022.
- Concepts for decision making under severe uncertainty with partial ordinal and partial cardinal preferences. International Journal of Approximate Reasoning, 98:112–131, 2018.
- Information efficient learning of complexly structured preferences: Elicitation procedures and their application to decision making under uncertainty. International Journal of Approximate Reasoning, 144:69–91, 2022a.
- Quantifying degrees of E-admissibility in decision making with imprecise probabilities. In T. Augustin, F. Cozman, and G. Wheeler, editors, Reflections on the Foundations of Probability and Statistics: Essays in Honor of Teddy Seidenfeld, pages 319–346. Springer, 2022b.
- Multi-target decision making under conditions of severe uncertainty. In V. Torra and Y. Narukawa, editors, Modeling Decisions for Artificial Intelligence, volume 13890 of Lecture Notes in Artificial Intelligence, pages 45–57. Springer, 2023a.
- Robust statistical comparison of random variables with locally varying scale of measurement. In Uncertainty in Artificial Intelligence. PMLR, 2023b. to appear.
- Foundations of Measurement. Volume I: Additive and Polynomial Representations. Academic Press, San Diego and London, 1971.
- N. Lavesson and P. Davidsson. Evaluating learning algorithms and classifiers. International Journal of Intelligent Information and Database Systems, 1:37–52, 2007.
- A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.
- R. Luce. Semiorders and a theory of utility discrimination. Econometrica, 24:178–191, 1956.
- R. Marler and J. Arora. The weighted sum method for multi-objective optimization: new insights. Structural and Multidisciplinary Optimization, 41(6):853–862, 2010.
- D. McFadden. Testing for stochastic dominance. In T. Fomby and T. Seo, editors, Studies in the Economics of Uncertainty, pages 113–134. Springer, 1989.
- Analyzing the BBOB results by means of benchmarking concepts. Evolutionary Computation, 23:161–185, 2015.
- The support vector machine under test. Neurocomputing, 55(1):169–186, 2003.
- K. Mosler. Testing whether two distributions are stochastically ordered or not. In H. Rinne, B. Rüger, and H. Strecker, editors, Grundlagen der Statistik und ihre Anwendungen: Festschrift für Kurt Weichselberger, pages 149–155. Physica-Verlag, 1995.
- K. Mosler and M. Scarsini. Some theory of stochastic dominance. In K. Mosler and M. Scarsini, editors, Stochastic Orders and Decision under Risk, pages 203–212. Institute of Mathematical Statistics, Hayward, CA, 1991.
- Multi-objective parameter selection for classifiers. Journal of Statistical Software, 46:1–27, 2012.
- J. Pratt and J. Gibbons. Concepts of Nonparametric Theory. Springer, 2012.
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021. URL https://www.R-project.org/.
- T. Range and L. Østerdal. First-order dominance: stronger characterization and a bivariate checking algorithm. Mathematical Programming, 173:193––219, 2019.
- Dealing with the evaluation of supervised classification algorithms. Artificial Intelligence Review, 44:467–508, 2015.
- Detecting stochastic dominance for poset-valued random variables as an example of linear programming on closure systems, 2017. URL https://epub.ub.uni-muenchen.de/40416/13/TR_209.pdf. Technical Report 209, Department of Statistics, LMU Munich.
- False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11):1359–1366, 2016.
- T. Therneau and B. Atkinson. rpart: Recursive Partitioning and Regression Trees, 2019. URL https://CRAN.R-project.org/package=rpart. R package version 4.1-15.
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- G. Webb. Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40:159–196, 2000.
- Y.-J. Whang. Econometric analysis of stochastic dominance: Concepts, methods, tools, and applications. Cambridge University Press, 2019.
- B. Yu and K. Kumbier. Veridical data science. Proceedings of the National Academy of Science, 117(8):3920–3929, 2020.
- H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Methodological), 67(2):301–320, 2005.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.