Counting Hallucination in Diffusion Models
- Counting hallucination is defined as generative models producing images with incorrect object counts that violate predefined physical or logical constraints.
- The CountHalluSet benchmark uses diverse datasets and a multi-stage evaluation protocol to systematically quantify counting errors in synthetic and real-world images.
- Mitigation strategies, including joint-diffusion models and refined sampling techniques, aim to reduce numerical inaccuracies and improve semantic fidelity in output images.
Counting hallucination refers to the phenomenon in generative models—particularly vision and vision-LLMs—where the generated output contains an incorrect number of instances of a given object or structure, violating established physical or logical constraints. This phenomenon is especially problematic in diffusion probabilistic models (DPMs) for image and video synthesis, where outputs with implausible object counts, such as a hand with six fingers or duplicate objects not present in the data, signal a clear disconnect from real-world distributions and factual priors. Systematic quantification of counting hallucinations is essential for understanding, evaluating, and advancing the reliability of generative models under factual and semantic constraints (Fu et al., 15 Oct 2025).
1. Defining Counting Hallucination in Generative Models
In diffusion models, counting hallucination is defined as the generation of an image containing an incorrect number of objects relative to a predefined criterion. Formally, for each generated sample , object category , and a reference set of allowed counts , a diagnosis of counting hallucination is made if the predicted number of objects falls outside , or if the image is empty when at least one object is expected: where is the "counting-ready" indicator, i.e., it marks whether the image is of sufficient quality and clarity to be countable (Fu et al., 15 Oct 2025). In the RealHand dataset, for example, only images with exactly five fingers are deemed non-hallucinatory.
Counting hallucinations manifest when the model outputs images that, despite plausible appearance, contain numerical inaccuracies in object count (e.g., two triangles in a ToyShape scenario meant to include at most one of each geometry, or six fingers on a hand). This form of hallucination is prevalent even in state-of-the-art models and is distinct from general perceptual failures such as blurriness or severe structural deformation.
2. Datasets and Benchmarking: CountHalluSet
To systematically evaluate counting hallucinations, a specialized benchmark suite called CountHalluSet is introduced (Fu et al., 15 Oct 2025). This suite contains three datasets:
- ToyShape: 30,000 images of basic geometric shapes (triangle, square, pentagon), each with a constraint that every shape appears at most once and at least one is present per image.
- SimObject: 30,000 photorealistically rendered images (mug, apple, clock) from Unreal Engine 5, with the same counting rules as ToyShape.
- RealHand: About 5,050 real-world hand images, each expected to contain exactly five fingers. A counting-ready classifier, calibrated by human annotation, is employed to filter out images where reliable counting is not feasible.
Each dataset applies stringent criteria for annotation:
- For a generated image, the detection process first ascertains count-readiness via classification.
- Subsequently, a counting model (typically based on YOLO for real objects or simple classifiers for shapes) predicts the object count.
- Labeling proceeds by comparing to for each object category and marking as true when deviations occur.
These protocols ensure the evaluation isolates counting hallucinations from unrelated generative artifacts.
3. Standardized Evaluation Protocol
The proposed evaluation pipeline leverages both automated detection and quality control mechanisms:
- Counting-Ready Filter: Each output is first passed through a classifier that discriminates countable images from those suffering from severe artifacts. This filtering is almost perfect for synthetic datasets but critical for real-world data like RealHand.
- Object Counting: For each count-ready image, an object detection model tallies the number of relevant objects (fingers, shapes, etc.).
- Hallucination Quantification: Images that exceed or fall short of the predefined counts, or are empty when non-emptiness is required, are flagged as counting hallucinations using the binary indicator given above.
By generating an evaluation set of the same size as the training data and applying this protocol, the hallucination rate is quantified as the proportion of outputs where .
4. Influence of Diffusion Sampling Parameters
The incidence of counting hallucinations is shown to be sensitive to several critical parameters of the diffusion process:
- Solver Type: Ancestral sampling (DDPM), which uses a large number of small-denoising steps (e.g., 1,000), consistently produces fewer counting hallucinations than ODE-based solvers (DPM-Solvers, DDIM-equivalents), even at equivalent or higher iterations.
- Solver Order and Step Count: In synthetic domains, increasing steps for first-order solvers usually decreases counting hallucinations, while for more complex data (RealHand), the relationship can reverse, with more steps sometimes amplifying hallucination rates unless higher-order solvers are used.
- Initial Noise Distribution: Using "diffused" noise distributions—closer to those seen during model training—rather than standard Gaussian noise, reduces both the hallucination rate and the incidence of uncountable images, indicating the importance of distributional initialization.
The precise response of hallucination metrics to these parameters is data-dependent, evidencing nuanced interactions between sampling mechanics and high-level semantic consistency constraints.
5. Relationship to Existing Metrics and Limitations
A key finding is that prevalent perceptual quality metrics such as the Fréchet Inception Distance (FID) are insufficient proxies for counting hallucinations. Correlation analysis reveals that, depending on dataset and sampling regime:
- For SimObject, lower FID tentatively indicates fewer hallucinations (positive Pearson ).
- For RealHand, correlations can reverse (negative Pearson ignoring DDPM results), demonstrating that two images may score similarly on FID but differ drastically in factual correctness (e.g., correct number of fingers).
This disconnect highlights that FID, being a global perceptual similarity measure, fails to penalize object count errors that are semantically critical.
6. Mitigation Strategies and Future Research Directions
To reduce counting hallucinations, the paper introduces a "joint-diffusion model" (JDM) paradigm that incorporates explicit structural constraints. For RealHand, this is operationalized by concatenating a hand segmentation mask to the input, directly guiding the model toward correct object structure and count. JDMs outperform standard DPMs both in terms of counting hallucination rate and general robustness to non-counting failures.
The study outlines several future directions:
- Development of ODE solvers and sampling heuristics that can balance perceptual fidelity with strict factual constraints.
- Integration of loss functions or auxiliary supervision signals that explicitly reward correct object count during training.
- Extension of quantification protocols to other forms of factual hallucination, such as spatial or relational errors.
- Exploitation of structural priors (e.g., segmentation, masks) in sampling or inference to better align generation with real-world semantics.
This multifaceted approach is necessary because high-level semantic accuracy (e.g., object count) and low-level visual quality are not always aligned and must be jointly optimized for reliable generative inference.
Counting hallucination in diffusion models thus refers to the model’s failure to generate the correct number of object instances, and its quantification requires categorical, protocol-driven evaluation beyond traditional perceptual metrics. Study of this phenomenon reveals important limitations of current model architectures and evaluation standards, motivating new frameworks for both benchmarking and training under factual constraints (Fu et al., 15 Oct 2025).