CountHalluSet: Benchmarking Counting Hallucinations
- CountHalluSet is a suite of datasets and protocols that systematically quantifies counting hallucinations in diffusion models using strict counting criteria and pre-trained models.
- Its standardized evaluation protocol benchmarks generative performance across varied sampling regimes, solver orders, and noise initializations to uncover errors undetected by conventional FID.
- The framework promotes integrating factual consistency audits into generative model development, addressing risks in high-stakes applications such as medical imaging and scientific illustration.
CountHalluSet is a suite of datasets and protocols designed for the systematic quantification of counting hallucinations in diffusion probabilistic models (DPMs). Counting hallucinations refer to generative failures where the synthesized content contains an incorrect number of instances—such as generating a hand with six fingers—despite the training distribution prohibiting such outcomes. CountHalluSet establishes well-defined benchmarks supported by pre-trained counting models and rigorously specified criteria, enabling quantitative evaluation of factual consistency errors that are not reliably detected by standard perceptual metrics like FID (Fu et al., 15 Oct 2025).
1. Dataset Composition and Counting Criteria
CountHalluSet comprises three complementary datasets, each constructed to demand precise object counting and restrict category-wise instance numbers:
- ToyShape: 30,000 images containing white triangles, squares, and pentagons (each area 120 px). The counting criterion enforces at most one instance per shape type, with at least one shape per image (, for all ; ).
- SimObject: 30,000 images featuring photorealistic renders of mugs, apples, and clocks (each with 10 3D variants) on an uncluttered tabletop. The same counting constraints as ToyShape apply.
- RealHand: 5,050 images, each displaying five human fingers in canonical dorsal or palmar views under varied backgrounds/lighting (, , ).
Reference sets strictly comply with the imposed counting standards. Pre-trained counting models facilitate automated quantification: ResNet-50 (ToyShape and SimObject, reporting 99.9% accuracy), MaxViT and YOLO-12 (RealHand, providing 99% fingertip detection accuracy and 96% agreement for counting-readiness with human judgment) (Fu et al., 15 Oct 2025).
2. Formalism for Counting Hallucinations
A counting hallucination is formally defined for generated sample as follows. For each category , let denote the count predicted by the counting model. The counting-ready indicator filters uncountable samples (blurred, deformed, etc.). Define
For a batch of generated samples, the counting-hallucination rate (CHR) aggregates the proportion exhibiting counting violations:
In RealHand, non-counting failure rate (NCFR) quantifies samples discarded for counting unreadiness, and total failure rate (TFR) sums CHR and NCFR. This formalism enables fine-grained isolation of errors violating training-distribution counts, regardless of perceptual quality (Fu et al., 15 Oct 2025).
3. Standardized Evaluation Protocol
Diffusion models are evaluated via a standardized protocol utilizing CountHalluSet:
- Modeling: DDPM is trained from scratch for ToyShape; LDM fine-tuned from CelebA-HQ for SimObject and RealHand.
- Sampling Regimes: Models are sampled under varied hyperparameters:
- Solver type/order: DDPM (ancestral, 1,000 steps), DPM-Solver-1 (first-order ODE, DDIM equivalent), DPM-Solver-2 (second-order ODE).
- Sampling steps for ODE solvers: $25, 50, 100$.
- Initial noise: “normal” () vs “diffused” (drawn from true ).
- Quantification: For each configuration, samples ( per condition) are generated over three random seeds. The counting pipeline applies the pretrained models (direct counting for ToyShape/SimObject; CRI then YOLO-12 for RealHand). Metrics (CHR, NCFR, TFR) are computed following the detailed definitions above (Fu et al., 15 Oct 2025).
- Best Practices: Systematic isolation of hyperparameter impacts on CHR is achieved by varying one parameter at a time. Researchers are encouraged to report CHR metrics parallel to standard scores (e.g. FID) to expose factual consistency failures not encoded by perceptual metrics.
4. Key Empirical Findings
CountHalluSet enables identification of factors influencing counting hallucinations in generative models:
- Solver Granularity: On ToyShape/SimObject, increasing ODE steps (25 100) reduces CHR (e.g., ToyShape CHR (diffused noise) from 2.43% to 1.56%, DPM-Solver-1); on RealHand, CHR rises with step count, except occasional reversals (notably DPM-Solver-2 at 100).
- Solver Order: Second-order ODE solvers reduce NCFR but increase CHR on RealHand, although lowering overall TFR.
- Ancestral Sampling: DDPM (1,000 steps) achieves the lowest CHR and TFR in all conditions, representing a practical lower bound for counting errors.
- Noise Initialization: “Diffused” noise consistently lowers counting and total failure rates compared to “normal” initialization, evidencing the significance of informed starting points.
- Object Complexity: CHR escalates with object morphological complexity: ToyShape SimObject RealHand.
- Correlation with FID: On SimObject, higher FID correlates positively with CHR (Pearson , Spearman ). On RealHand, when restricting to ODE samplers, FID correlates negatively with CHR (Pearson , Spearman ), but the inclusion of ancestral sampling breaks these trends, reducing correlation to insignificance. TFR and NCFR align robustly and positively with FID on RealHand, indicating conventional perceptual metrics capture global image failures but poorly reflect factual consistency at the counting level (Fu et al., 15 Oct 2025).
5. Protocols for Integration and Benchmarking
CountHalluSet provides researchers with:
- Dataset resources: Strictly verified benchmarks for counting hallucinations.
- Pre-trained models: ResNet-50 (ToyShape/SimObject) and MaxViT/YOLO-12 (RealHand), plus counting-ready indicators with high human rater agreement.
- Evaluation scripts: Standardized pipelines for counting model application and metric aggregation.
- Comparison framework: Direct baseline results spanning solver type, order, step count, and noise settings to facilitate model comparisons under uniform factual criteria (Fu et al., 15 Oct 2025).
For integration into new models, practitioners should ensure sampling parity with the reference set and conduct multiple seeds per configuration. Counting criteria and protocols must be respected for factual error quantification.
6. Implications and Significance
CountHalluSet demonstrates that perceptual metrics such as FID alone do not reliably diagnose counting hallucinations or factual consistency violations in generative models. It exposes the risk of relying on appearance-based scores in contexts where the enumeration of objects or structural features is critical—such as medical imaging or scientific illustration. A plausible implication is that future generative model development must integrate explicit factual consistency audits alongside conventional quality assessment protocols. CountHalluSet provides foundational resources for such efforts, facilitating rigorously controlled experimentation and methodology advancement in quantifying and reducing generative model hallucinations (Fu et al., 15 Oct 2025).