Hurdle-Shifted Negative Binomial Model
- Hurdle-Shifted Negative Binomial is a two-component model that separates the zero hurdle from the positive count process to address overdispersion.
- It combines a logistic hurdle for zero generation with a zero-truncated negative binomial for strictly positive outcomes, enhancing interpretability.
- The model integrates covariate effects and supports Bayesian clustering, making it ideal for applications like microbiome and single-cell RNA sequencing data.
The hurdle-shifted negative binomial (HNB) model is a two-component framework for modeling count data with an excess of zeros (zero-inflation) or, more generally, for data exhibiting deviations from standard count distributions near zero. It decouples the mechanism generating zeros from the process governing positive counts, combining a logistic hurdle at zero with a zero-truncated negative binomial (NB) model for strictly positive outcomes. The model is widely applied to domains such as microbiome data, single-cell RNA sequencing, and messaging data—contexts where structural zeros and overdispersion are prominent features. Its flexibility extends to both classical regression and Bayesian clustering, with capacity for covariate incorporation and multivariate extension (Franzolini et al., 2022, Beveridge et al., 2024).
1. Model Structure and Probability Mass Function
The HNB model posits that observed counts arise via two mechanisms:
- With hurdle probability ( in regression parameterizations), the outcome is positive ().
- Conditioned on crossing the hurdle, strictly positive counts follow a shifted or zero-truncated NB.
The general pmf can be expressed as:
with
where (size) and (success probability) (Franzolini et al., 2022).
In the regression context, the HNB is alternatively written in the mean–dispersion NB parameterization:
where is the standard NB pmf (Beveridge et al., 2024). The model thus separates structural zeros from sampling zeros, and truncates the count model to strictly positive values.
2. Multivariate and Bayesian Extensions
For independent zero-inflated processes, let with hurdle and NB parameters . The joint likelihood assumes conditional independence across processes: The total likelihood for all data is: This modeling approach substantially reduces the number of parameters compared to fully general multivariate models (Franzolini et al., 2022).
A two-level enriched finite mixture prior is introduced for Bayesian modeling and clustering:
- Outer mixture on zero vs positive (hurdle patterns), with components, weight vector , and Bernoulli hurdle parameters .
- Inner mixture on positive counts, with components per outer atom, weights , and cluster-specific NB parameters .
The hierarchical mixture induces nested clustering grouping subjects first by zero/positive patterns, then by similarity of their positive-count behavior (Franzolini et al., 2022).
3. Parameterization, Covariates, and Estimation
The NB count process is parameterized by either or, in regression settings, by its mean and dispersion :
Covariate dependence is handled via GLM-style link functions: where denotes the covariate vector; and are coefficient vectors for the count and hurdle components, respectively. In parallel multivariate settings, each outcome may have its own regression parameters (Beveridge et al., 2024).
Estimation is typically via direct maximization of the (closed-form) likelihood: No EM algorithm is required; block-coordinate or joint Newton–Raphson optimization suffices (Beveridge et al., 2024).
4. Hierarchical Priors and Posterior Sampling
Bayesian implementations employ hierarchical priors for mixture dimensions, component weights, and distributional parameters:
- Outer mixture size ; Dirichlet weights; Beta prior for hurdle probabilities.
- Inner mixture size ; Dirichlet weights; Geometric prior for NB size; Beta prior for NB success probabilities.
Posterior inference is performed via a tailored MCMC scheme:
- Gibbs-type updates for cluster allocations.
- Blocked updates for mixture sizes, weights, and cluster-specific parameters.
- Marginal sampling integrating out weights and latent variables, utilizing closed-form conjugate updates and numerical approximation for infinite-sum terms in NB likelihoods.
- The algorithm supports both conditional and marginal samplers with exact conditional distributions for mixture assignments, component parameters, and hyperparameters (Franzolini et al., 2022).
5. Comparative Performance and Model Selection
Empirical results indicate that the HNB model excels in settings with zero-deflation and strong covariate-driven structure:
- In standard zero-inflated data, the zero-inflated NB (ZINB) model performs best.
- When zeros are less frequent than the NB predicts (zero-deflation), HNB substantially outperforms ZINB, with improvements accentuated as deflation increases.
- In high-dimensional settings without informative covariates, latent Gaussian copula models (such as TLNPN) can better capture joint dependence, especially under strong feature correlation.
- When meaningful covariates are available, HNB more accurately models marginal and joint extremes and consistently outperforms copula models when predictors are strongly associated with outcomes (Beveridge et al., 2024).
Zero-proportion by itself exerts minimal influence on comparative fit for HNB and TLNPN; the availability and informativeness of covariates, along with the magnitude of inter-variable dependence, are the primary drivers of model selection.
6. Nested Clustering and Interpretability
The two-level clustering induced by the enriched mixture yields a nested partition of subjects:
- Outer clusters group subjects sharing similar hurdle-crossing probabilities, i.e., similar patterns of structural zero occurrence.
- Inner clusters, nested within each outer group, refine the partition by grouping subjects with similar positive-count distributions (shifted NB characteristics).
This structure enables interpretable exploration of the data: first identifying major subpopulations by their tendency for zeros, then further discriminating by intensity or dispersion of positive counts (Franzolini et al., 2022).
7. Practical Guidelines and Application Domains
Practical recommendations for leveraging the HNB model include:
- Use HNB when both zero-accumulation and positive-count intensity should be flexibly and independently modeled, particularly in the presence of covariates.
- Fit both the occurrence of zeros and the distribution of positive counts via generalized linear model frameworks, which facilitates interpretability and control for confounding (Beveridge et al., 2024).
- In Bayesian settings, employ multi-level mixture models for flexible clustering across both zero patterns and positive-count behavior (Franzolini et al., 2022).
Typical application areas include high-throughput sequencing datasets, digital communication logs, and any count data environment characterized by excess zeros and overdispersion. The explicit hurdle structure allows the HNB model to account for structural phenomena leading to irregular zero frequencies—whether excess or scarcity—across a range of modern biomedical and communication datasets.