Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hurdle-Shifted Negative Binomial Model

Updated 24 January 2026
  • Hurdle-Shifted Negative Binomial is a two-component model that separates the zero hurdle from the positive count process to address overdispersion.
  • It combines a logistic hurdle for zero generation with a zero-truncated negative binomial for strictly positive outcomes, enhancing interpretability.
  • The model integrates covariate effects and supports Bayesian clustering, making it ideal for applications like microbiome and single-cell RNA sequencing data.

The hurdle-shifted negative binomial (HNB) model is a two-component framework for modeling count data with an excess of zeros (zero-inflation) or, more generally, for data exhibiting deviations from standard count distributions near zero. It decouples the mechanism generating zeros from the process governing positive counts, combining a logistic hurdle at zero with a zero-truncated negative binomial (NB) model for strictly positive outcomes. The model is widely applied to domains such as microbiome data, single-cell RNA sequencing, and messaging data—contexts where structural zeros and overdispersion are prominent features. Its flexibility extends to both classical regression and Bayesian clustering, with capacity for covariate incorporation and multivariate extension (Franzolini et al., 2022, Beveridge et al., 2024).

1. Model Structure and Probability Mass Function

The HNB model posits that observed counts Y{0,1,2,}Y\in\{0,1,2,\dots\} arise via two mechanisms:

  • With hurdle probability p(0,1)p\in(0,1) (πH\pi_H in regression parameterizations), the outcome is positive (Y1Y\ge 1).
  • Conditioned on crossing the hurdle, strictly positive counts follow a shifted or zero-truncated NB.

The general pmf can be expressed as:

f(yp,r,θ)={1p,y=0 pg(yr,θ),y1f(y \mid p, r, \theta) = \begin{cases} 1-p, & y=0 \ p \cdot g(y\mid r,\theta), & y\geq 1 \end{cases}

with

g(yr,θ)=(y+r2)!(r1)!(y1)!θy1(1θ)r,y1,g(y \mid r, \theta) = \frac{(y+r-2)!}{(r-1)!(y-1)!} \, \theta^{y-1} \, (1-\theta)^r, \quad y \geq 1,

where r{1,2,...}r \in \{1, 2, ...\} (size) and θ(0,1)\theta \in (0,1) (success probability) (Franzolini et al., 2022).

In the regression context, the HNB is alternatively written in the mean–dispersion NB parameterization:

P(Yi=0)=πH,i P(Yi=y>0)=(1πH,i)fNB(y;μi,r)1fNB(0;μi,r)\begin{aligned} P(Y_i = 0) &= \pi_{H,i} \ P(Y_i = y > 0) &= (1-\pi_{H,i})\, \frac{f_{\mathrm{NB}}(y;\mu_i, r)}{1 - f_{\mathrm{NB}}(0; \mu_i, r)} \end{aligned}

where fNBf_{\mathrm{NB}} is the standard NB pmf (Beveridge et al., 2024). The model thus separates structural zeros from sampling zeros, and truncates the count model to strictly positive values.

2. Multivariate and Bayesian Extensions

For dd independent zero-inflated processes, let Yi=(Yi1,...,Yid)\bm{Y}_i = (Y_{i1}, ..., Y_{id}) with hurdle and NB parameters (pi,ri,θi)(\bm{p}_i, \bm{r}_i, \bm{\theta}_i). The joint likelihood assumes conditional independence across processes: P(Yi=yipi,ri,θi)=j=1df(yijpij,rij,θij)P(\bm{Y}_i = \bm{y}_i \mid \bm{p}_i, \bm{r}_i, \bm{\theta}_i) = \prod_{j=1}^d f(y_{ij} \mid p_{ij}, r_{ij}, \theta_{ij}) The total likelihood for all data is: L=i=1nj=1df(yijpij,rij,θij)L = \prod_{i=1}^n \prod_{j=1}^d f(y_{ij} \mid p_{ij}, r_{ij}, \theta_{ij}) This modeling approach substantially reduces the number of parameters compared to fully general multivariate models (Franzolini et al., 2022).

A two-level enriched finite mixture prior is introduced for Bayesian modeling and clustering:

  • Outer mixture on zero vs positive (hurdle patterns), with MM components, weight vector w\bm{w}, and Bernoulli hurdle parameters p\bm{p}^\star.
  • Inner mixture on positive counts, with SmS_m components per outer atom, weights qm\bm{q}_m, and cluster-specific NB parameters (rms,θms)(\bm{r}^\star_{ms},\bm{\theta}^\star_{ms}).

The hierarchical mixture induces nested clustering grouping subjects first by zero/positive patterns, then by similarity of their positive-count behavior (Franzolini et al., 2022).

3. Parameterization, Covariates, and Estimation

The NB count process is parameterized by either (r,θ)(r,\theta) or, in regression settings, by its mean μ\mu and dispersion rr:

fNB(y;μ,r)=Γ(y+r)Γ(r)y!(μμ+r)y(rμ+r)r,y=0,1,2,...f_{\mathrm{NB}}(y;\mu,r) = \frac{\Gamma(y+r)}{\Gamma(r)y!} \left(\frac{\mu}{\mu + r}\right)^y \left(\frac{r}{\mu + r}\right)^r, \quad y=0,1,2,...

Covariate dependence is handled via GLM-style link functions: logμi=xiTβ,logitπH,i=xiTγ\log \mu_i = x_i^T \beta, \qquad \operatorname{logit} \pi_{H,i} = x_i^T \gamma where xix_i denotes the covariate vector; β\beta and γ\gamma are coefficient vectors for the count and hurdle components, respectively. In parallel multivariate settings, each outcome may have its own regression parameters (Beveridge et al., 2024).

Estimation is typically via direct maximization of the (closed-form) likelihood: H(β,γ,r)=i=1n[Ii,0lnπH,i+Ii,+(ln(1πH,i)+lnfNB(Yi;μi,r)ln(1fNB(0;μi,r)))]\ell_H(\beta,\gamma,r) = \sum_{i=1}^n \left[ I_{i,0} \ln \pi_{H,i} + I_{i,+}(\ln(1-\pi_{H,i}) + \ln f_{\mathrm{NB}}(Y_i; \mu_i, r) - \ln(1 - f_{\mathrm{NB}}(0; \mu_i, r))) \right] No EM algorithm is required; block-coordinate or joint Newton–Raphson optimization suffices (Beveridge et al., 2024).

4. Hierarchical Priors and Posterior Sampling

Bayesian implementations employ hierarchical priors for mixture dimensions, component weights, and distributional parameters:

  • Outer mixture size MPoi0(ΛM)M\sim\mathrm{Poi}_0(\Lambda_M); Dirichlet weights; Beta prior for hurdle probabilities.
  • Inner mixture size SmPoi0(ΛS)S_m\sim\mathrm{Poi}_0(\Lambda_S); Dirichlet weights; Geometric prior for NB size; Beta prior for NB success probabilities.

Posterior inference is performed via a tailored MCMC scheme:

  • Gibbs-type updates for cluster allocations.
  • Blocked updates for mixture sizes, weights, and cluster-specific parameters.
  • Marginal sampling integrating out weights and latent variables, utilizing closed-form conjugate updates and numerical approximation for infinite-sum terms in NB likelihoods.
  • The algorithm supports both conditional and marginal samplers with exact conditional distributions for mixture assignments, component parameters, and hyperparameters (Franzolini et al., 2022).

5. Comparative Performance and Model Selection

Empirical results indicate that the HNB model excels in settings with zero-deflation and strong covariate-driven structure:

  • In standard zero-inflated data, the zero-inflated NB (ZINB) model performs best.
  • When zeros are less frequent than the NB predicts (zero-deflation), HNB substantially outperforms ZINB, with improvements accentuated as deflation increases.
  • In high-dimensional settings without informative covariates, latent Gaussian copula models (such as TLNPN) can better capture joint dependence, especially under strong feature correlation.
  • When meaningful covariates are available, HNB more accurately models marginal and joint extremes and consistently outperforms copula models when predictors are strongly associated with outcomes (Beveridge et al., 2024).

Zero-proportion by itself exerts minimal influence on comparative fit for HNB and TLNPN; the availability and informativeness of covariates, along with the magnitude of inter-variable dependence, are the primary drivers of model selection.

6. Nested Clustering and Interpretability

The two-level clustering induced by the enriched mixture yields a nested partition of subjects:

  • Outer clusters group subjects sharing similar hurdle-crossing probabilities, i.e., similar patterns of structural zero occurrence.
  • Inner clusters, nested within each outer group, refine the partition by grouping subjects with similar positive-count distributions (shifted NB characteristics).

This structure enables interpretable exploration of the data: first identifying major subpopulations by their tendency for zeros, then further discriminating by intensity or dispersion of positive counts (Franzolini et al., 2022).

7. Practical Guidelines and Application Domains

Practical recommendations for leveraging the HNB model include:

  • Use HNB when both zero-accumulation and positive-count intensity should be flexibly and independently modeled, particularly in the presence of covariates.
  • Fit both the occurrence of zeros and the distribution of positive counts via generalized linear model frameworks, which facilitates interpretability and control for confounding (Beveridge et al., 2024).
  • In Bayesian settings, employ multi-level mixture models for flexible clustering across both zero patterns and positive-count behavior (Franzolini et al., 2022).

Typical application areas include high-throughput sequencing datasets, digital communication logs, and any count data environment characterized by excess zeros and overdispersion. The explicit hurdle structure allows the HNB model to account for structural phenomena leading to irregular zero frequencies—whether excess or scarcity—across a range of modern biomedical and communication datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hurdle-Shifted Negative Binomial.