Papers
Topics
Authors
Recent
Search
2000 character limit reached

Encoding categorical data: Is there yet anything 'hotter' than one-hot encoding?

Published 28 Dec 2023 in cs.LG | (2312.16930v1)

Abstract: Categorical features are present in about 40% of real world problems, highlighting the crucial role of encoding as a preprocessing component. Some recent studies have reported benefits of the various target-based encoders over classical target-agnostic approaches. However, these claims are not supported by any statistical analysis, and are based on a single dataset or a very small and heterogeneous sample of datasets. The present study explores the encoding effects in an exhaustive sample of classification problems from OpenML repository. We fitted linear mixed-effects models to the experimental data, treating task ID as a random effect, and the encoding scheme and the various characteristics of categorical features as fixed effects. We found that in multiclass tasks, one-hot encoding and Helmert contrast coding outperform target-based encoders. In binary tasks, there were no significant differences across the encoding schemes; however, one-hot encoding demonstrated a marginally positive effect on the outcome. Importantly, we found no significant interactions between the encoding schemes and the characteristics of categorical features. This suggests that our findings are generalizable to a wide variety of problems across domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, 11(11), 2020.
  2. A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications, 175:7–9, 10 2017.
  3. François De La Bourdonnaye and Fabrice Daniel. Evaluating categorical encoding methods on a real credit card fraud detection database. CoRR, abs/2112.12024, 2021.
  4. Benchmark of encoders of nominal features for regression. 1365:146–155, 2021.
  5. Measuring the effect of categorical encoders in machine learning tasks using synthetic data. 13067:92–107, 2021.
  6. Openml: networked science in machine learning. CoRR, abs/1407.7722, 2014.
  7. Scikit-learn: Machine learning in python. CoRR, abs/1201.0490, 2012.
  8. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016.
  9. Xgboost: eXtreme Gradient Boosting. 2015.
  10. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, pages 1–22, 03 2022.
  11. Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explorations, 3:27–32, 07 2001.
  12. Domain adaptation under target and conditional shift. 28(3):819–827, 17–19 Jun 2013.
  13. Catboost: unbiased boosting with categorical features. pages 6639–6649, 2017.
  14. Fairness implications of encoding protected categorical attributes. CoRR, abs/2201.11358, 2022.
  15. Category encoders: a scikit-learn-contrib package of transformers for encoding categorical data. Journal of Open Source Software, 3(21):501, 2018.
  16. Gregory Carey. Quantitative Methods In Neuroscience. Morgan Kaufman Publishers, 2013.
  17. Stephanie van den Berg. Research methodology, measurement and data analysis. 2022.
  18. Guoping Zeng. A necessary condition for a good binning algorithm in credit scoring. Applied mathematical sciences, 8:3229–3242, 2014.
  19. Good Irving John. Weight of evidence: A brief survey. Bayesian Statistics, 2:249–270, 1985.
  20. H2o automl scalable automatic machine learning. 7th ICML Workshop on Automated Machine Learning (AutoML), July 2020.
  21. An open source automl benchmark. CoRR, abs/1907.00909, 2019.
  22. David H. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992.
  23. Super learner. Statistical Applications in Genetics and Molecular Biology, 6, 2007.
  24. Eshin Jolly. Pymer4: Connecting R and Python for linear mixed modeling. Journal of Open Source Software, 3(31):862, 2018.
  25. R Core Team. R: A language and environment for statistical computing. MSOR connections, 1, 2014.
  26. Encoding high-cardinality string categorical variables. CoRR, abs/1907.01860, 2019.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.