Encoding categorical data: Is there yet anything 'hotter' than one-hot encoding?
Abstract: Categorical features are present in about 40% of real world problems, highlighting the crucial role of encoding as a preprocessing component. Some recent studies have reported benefits of the various target-based encoders over classical target-agnostic approaches. However, these claims are not supported by any statistical analysis, and are based on a single dataset or a very small and heterogeneous sample of datasets. The present study explores the encoding effects in an exhaustive sample of classification problems from OpenML repository. We fitted linear mixed-effects models to the experimental data, treating task ID as a random effect, and the encoding scheme and the various characteristics of categorical features as fixed effects. We found that in multiclass tasks, one-hot encoding and Helmert contrast coding outperform target-based encoders. In binary tasks, there were no significant differences across the encoding schemes; however, one-hot encoding demonstrated a marginally positive effect on the outcome. Importantly, we found no significant interactions between the encoding schemes and the characteristics of categorical features. This suggests that our findings are generalizable to a wide variety of problems across domains.
- Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, 11(11), 2020.
- A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications, 175:7–9, 10 2017.
- François De La Bourdonnaye and Fabrice Daniel. Evaluating categorical encoding methods on a real credit card fraud detection database. CoRR, abs/2112.12024, 2021.
- Benchmark of encoders of nominal features for regression. 1365:146–155, 2021.
- Measuring the effect of categorical encoders in machine learning tasks using synthetic data. 13067:92–107, 2021.
- Openml: networked science in machine learning. CoRR, abs/1407.7722, 2014.
- Scikit-learn: Machine learning in python. CoRR, abs/1201.0490, 2012.
- Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016.
- Xgboost: eXtreme Gradient Boosting. 2015.
- Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, pages 1–22, 03 2022.
- Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explorations, 3:27–32, 07 2001.
- Domain adaptation under target and conditional shift. 28(3):819–827, 17–19 Jun 2013.
- Catboost: unbiased boosting with categorical features. pages 6639–6649, 2017.
- Fairness implications of encoding protected categorical attributes. CoRR, abs/2201.11358, 2022.
- Category encoders: a scikit-learn-contrib package of transformers for encoding categorical data. Journal of Open Source Software, 3(21):501, 2018.
- Gregory Carey. Quantitative Methods In Neuroscience. Morgan Kaufman Publishers, 2013.
- Stephanie van den Berg. Research methodology, measurement and data analysis. 2022.
- Guoping Zeng. A necessary condition for a good binning algorithm in credit scoring. Applied mathematical sciences, 8:3229–3242, 2014.
- Good Irving John. Weight of evidence: A brief survey. Bayesian Statistics, 2:249–270, 1985.
- H2o automl scalable automatic machine learning. 7th ICML Workshop on Automated Machine Learning (AutoML), July 2020.
- An open source automl benchmark. CoRR, abs/1907.00909, 2019.
- David H. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992.
- Super learner. Statistical Applications in Genetics and Molecular Biology, 6, 2007.
- Eshin Jolly. Pymer4: Connecting R and Python for linear mixed modeling. Journal of Open Source Software, 3(31):862, 2018.
- RÂ Core Team. R: A language and environment for statistical computing. MSOR connections, 1, 2014.
- Encoding high-cardinality string categorical variables. CoRR, abs/1907.01860, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.