Boosting Distributional Copula Regression for Bivariate Binary, Discrete and Mixed Responses
Abstract: Motivated by challenges in the analysis of biomedical data and observational studies, we develop statistical boosting for the general class of bivariate distributional copula regression with arbitrary marginal distributions, which is suited to model binary, count, continuous or mixed outcomes. In our framework, the joint distribution of arbitrary, bivariate responses is modelled through a parametric copula. To arrive at a model for the entire conditional distribution, not only the marginal distribution parameters but also the copula parameters are related to covariates through additive predictors. We suggest efficient and scalable estimation by means of an adapted component-wise gradient boosting algorithm with statistical models as base-learners. A key benefit of boosting as opposed to classical likelihood or Bayesian estimation is the implicit data-driven variable selection mechanism as well as shrinkage without additional input or assumptions from the analyst. To the best of our knowledge, our implementation is the only one that combines a wide range of covariate effects, marginal distributions, copula functions, and implicit data-driven variable selection. We showcase the versatility of our approach on data from genetic epidemiology, healthcare utilization and childhood undernutrition. Our developments are implemented in the R package gamboostLSS, fostering transparent and reproducible research.
- Flexible instrumental variable distributional regression. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(4):1553ā1574.
- Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4):477ā505.
- The UK Biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203ā209.
- In mixed company: Bayesian inference for bivariate conditional copula models with discrete and continuous outcomes. Journal of Multivariate Analysis, 110:106ā120.
- Friedman, J.Ā H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189ā1232.
- Boosting distributional copula regression. Biometrics, 79(3):2298ā2310.
- Approaches to regularized regression ā A comparison between Gradient Boosting and the LASSO. Methods of Information in Medicine, 55(5):422ā430.
- Significance tests for boosted location and scale models with linear base-learners. The International Journal of Biostatistics, 15(1):20180110.
- gamboostLSS: An R package for model building and variable selection in the GAMLSS framework. Journal of Statistical Software, 74(1):1ā31.
- Model-based Boosting 2.0. Journal of Machine Learning Research, 11(71):2109ā2113.
- Estimating age- and height-specific percentile curves percentile curvesfor children using GAMLSS in the IDEFICS study. In Wilhelm, A.Ā F. and Kestler, H.Ā A., editors, Analysis of Large and Complex Data, pages 385ā394, Cham. Springer International Publishing.
- Analysis of sports data by using bivariate Poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3):381ā393.
- Klein, N. (2024). Distributional regression for data analysis. To appear in Annual Review of Statistics and its Application, 11.
- Simultaneous inference in structured additive conditional copula regression models: a unifying Bayesian approach. Statistics and Computing, 26(4):841ā860.
- Bayesian structured additive distributional regression for multivariate responses. Journal of the Royal Statistical Society Series C: Applied Statistics, 64(4):569ā591.
- Mixed binary-continuous copula regression models with application to adverse birth outcomes. Statistics in Medicine, 38(3):413ā436.
- Bivariate copula additive models for location, scale and shape. Computational Statistics and Data Analysis, 112:99ā113.
- A joint regression modeling framework for analyzing bivariate binary data in R. Dependence Modeling, 5(1):268ā294.
- Copula link-based additive models for right-censored event time data. Journal of the American Statistical Association, 115(530):886ā895.
- The evolution of boosting algorithms: From Machine Learning to Statistical Modelling. Methods of Information in Medicine, 53(6):419ā427.
- Generalized Additive Models for Location, Scale and Shape for high dimensional data ā A flexible approach based on Boosting. Journal of the Royal Statistical Society Series C: Applied Statistics, 61(3):403ā427.
- Linear or smooth? enhanced model choice in boosting via deselection of base-learners. Statistical Modelling, 23(5-6):441ā455.
- Nelsen, R.Ā B. (2006). An Introduction to Copulas. Springer New York.
- Odds RatiosāCurrent Best Practice and Use. JAMA, 320(1):84ā85.
- Evaluating the relationship between circulating lipoprotein lipids and apolipoproteins with risk of coronary heart disease: A multivariable Mendelian randomisation analysis. PLoS Medicine, 17(3):e1003062.
- Generalized Additive Models for Location, Scale and Shape. Journal of the Royal Statistical Society Series C: Applied Statistics, 54(3):507ā554.
- Distributions for modeling location, scale, and shape: Using GAMLSS in R. Chapman and Hall/CRC.
- Genetics of 35 blood and urine biomarkers in the UK Biobank. Nature Genetics, 53(2):185ā194.
- Smith, M.Ā S. (2013). Bayesian approaches to copula modelling. In Bayesian Theory and Applications. Oxford University Press.
- GAMLSS: A distributional regression approach. Statistical Modelling, 18(3ā4):248ā273.
- Boosting multivariate structured additive distributional regression models. Statistics in Medicine, 42(11):1779ā1801.
- Deselection of base-learners for Statistical Boosting with an application to distributional regression. Statistical Methods in Medical Research, 31(2):207ā224.
- Gradient Boosting for distributional regression: Faster tuning and improved variable selection via noncyclical updates. Statistics and Computing, 28(3):673ā687.
- A note on identification of bivariate copulas for discrete count data. Econometrics, 5(1):1ā11.
- UNICEF (2023). Nutrition and care for children with wasting.
- Generalised joint regression for count data: A penalty extension for competitive settings. Statistics and Computing, 30(5):1419ā1432.
- Sample selection models for count data in R. Computational Statistics, 33(3):1385ā1412.
- Yee, T.Ā W. (2015). Vector Generalized Linear and Additive Models. Springer, New York.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.