WhatCounts: Defining Counting in AI & Systems

Updated 5 February 2026

WhatCounts is a collection of frameworks, benchmarks, and model architectures that define and evaluate counting accuracy across diverse domains such as NLP, computer vision, and distributed systems.
It investigates how semantic invariance and explicit target-specification impact model performance, employing techniques like redundant counting, density regression, and exemplar fusion.
Practical insights include improved model robustness through open-world prompts, refined accuracy metrics, and distributed counter protocols that ensure reliable event aggregation.

WhatCounts refers to frameworks, benchmarks, and model architectures that address the fundamental question of "what is to be counted" in algorithmic, machine learning, computer vision, sensor-based, NLP, and distributed systems contexts. Recent research introduces distinct technical definitions of "what counts" according to application domain: the specification of target entities (objects, actions, patterns, or semantic classes) subject to counting, how model outputs are conditioned on this specification, and the sensitivity of a system's enumeration accuracy to the semantic or structural content of the inputs.

1. Algorithmic Semantics and the WhatCounts Benchmark

WhatCounts as formalized in the "Semantic Content Determines Algorithmic Performance" study provides an atomic, controlled benchmark to evaluate semantic invariance in algorithmic procedures, specifically in LLMs (Ríos-García et al., 29 Jan 2026). The central principle is that an ideal counting system should be invariant to the semantic class of the items being counted: counting a list of cities should be equivalent, in algorithmic behavior and accuracy, to counting chemicals, names, phone numbers, addresses, or symbols when list structure, prompt, and delimiter are held constant.

The benchmark constructs delimited lists (pipe " | " separated), each uniquely populated by items from a single semantic class. The primary evaluation metric is the semantic gap: $\Delta_{\mathrm{sem}}(m) = \max_{e\in\mathcal{E}} \mathrm{Acc}(m, e) - \min_{e\in\mathcal{E}} \mathrm{Acc}(m, e)$ where $\mathrm{Acc}(m, e)$ is the exact-count accuracy of model $m$ on semantic class $e$ , and $\mathcal{E}$ defines the tested classes. Empirical results show that state-of-the-art LLMs such as OpenAI o3, Claude-4, DeepSeek-v3, and Kimi-k2 exhibit semantic gaps as large as 0.40—accuracy varies by more than 40% across domains, despite constant structure and format. Ablations confirm the effect persists under explicit separator definition, XML-wrapped markup, token shuffling, and in agentic tool use settings; it increases with chain-of-thought prompting and is fragile under unrelated fine-tuning. The implication is that LLMs do not implement semantically agnostic operators but rather approximate them with latent semantic sensitivity, undermining the assumption that model subroutines serve as reliable algorithmic primitives (Ríos-García et al., 29 Jan 2026).

2. Specification of "What Counts" in Multimodal and Vision Models

The design of prompts and specification slots for "what counts" in object counting has evolved to support both open-world and negative definition capabilities. "CountGD++: Generalized Prompting for Open-World Counting" demonstrates a prompting interface comprising four orthogonal signal types: positive text (what to count), negative text (what not to count), positive visual exemplars, and negative visual exemplars. Each element can be present or absent; in operational terms, the model fuses these signals using cross-attention in a transformer backbone (SwinT), isolating object queries that best match the user-intended definition (Amini-Naieni et al., 29 Dec 2025).

Pseudo-exemplars extend the paradigm by automatically extracting visual exemplars from text-only input via top-N proposal selection, followed by a refinement pass, making manual annotation unnecessary. The model supports natural and synthetic external exemplars, further decoupling prompt specification from the input image. State-of-the-art accuracy is achieved across seven datasets (MAE drops from 16.55 to 8.39 on FSCD-147 with pseudo+synthetic exemplars), and negative exemplars/text reduce false positives by an order of magnitude. This architecture enables full user-driven flexibility over "what counts," including open vocabulary and "not-to-count" categories, without re-training for each definition (Amini-Naieni et al., 29 Dec 2025).

3. Algorithmic and Neural Approaches: Redundant Counting, Density Regression, and Exemplar Fusion

Counting system architectures operationalize "what counts" via various algorithmic primitives:

In "Count-ception," the prediction target at each pixel is the count of objects within a fixed receptive field window. The model produces a redundant count map, with each true object covered by many overlapping windows. The global count is determined by summing all local predictions and correcting for redundancy, mathematically: $\hat{C} = \frac{1}{(r/s)^2} \sum_{x,y} M(x,y)$ where $r$ is the window size, $s$ is the stride, and $M(x,y)$ is the per-window regressor output. This approach averages out local mis-counts, achieving superior MAE compared to density regression and traditional object detectors (Cohen et al., 2017).
Few-shot, exemplar-based counting, as in "Count What You Want," lets the user verbalize the target class by providing audio tokens ("one," "two," "three") at the onset of initial repetitions within a sensor sequence. The model localizes these utterances via keyword spotting and dynamic programming, extracts multi-scale temporal exemplars, computes a similarity map (via cross-correlation and soft-DTW) between exemplars and the entire sensory sequence, then regresses a density profile per time window. Summing the density yields the total count. The exemplar specification delivers robust generalization to new action classes and unseen subjects without explicit re-training or pre-defined action categories, with MAE of 7.47—lower than deep frequency- or transformer-based approaches (Huang et al., 2023).

4. Counting and Summarization in Structured Data and Sensor Streams

In domains such as genomics and device-free sensing, "what counts" is defined precisely in terms of feature overlap or event detection primitives:

"featureCounts" summarizes short-read NGS data by counting the number of mapped reads/fragments overlapping each genomic feature (e.g., gene, exon). Technical solutions, including hierarchical chromosome hashing, fixed-length bins, and feature blocks, reduce per-read search to $O(\sqrt{k_1})$ . Multiple user options allow "what counts" to be filtered by strand, mapping quality, multi-overlap handling (reads overlapping multiple features), or meta-feature assignment (gene-level, not just exon-level). Empirically, featureCounts achieves perfect agreement with other tools in uniform cases, 10–20x speedups, and superior memory efficiency, making feature-level quantification robust to varying user definitions of counted features (Liao et al., 2013).
CrossCount, in the WiFi human counting context, operationalizes "what counts" as the number of temporal line-of-sight blockages detected over a window. Feature extraction binarizes RSS measurements into blockage events, used to synthesize multi-person training data via bitwise-OR of independent trajectories, with an LSTM classifier producing people count estimates. The explicit mapping $\mathrm{Acc}(m, e)$ 0 aligns the counting function to the extracted temporal pattern, making the notion of "what counts" robust to physical environment and channel conditions (Ibrahim et al., 2020).

5. Distributed Systems and Scalability: What Really "Counts" in Counters

Distributed counters must define "what counts" at a global level in the presence of replication, network unreliability, and the CAP theorem's limits. "Scalable Eventually Consistent Counters over Unreliable Networks" introduces the Handoff Counters construction, a CRDT that serializes partial local counts upward through a network of client/server tiers via the exchange of "slots" and "tokens". The key contract is that every issued increment is eventually accounted for exactly once at the durable tier-0 set, regardless of network partition/loss.

Theoretical guarantees are formalized via the cumulative tier value $\mathrm{Acc}(m, e)$ 1, permanently bounding correctness. Handoff Counters eliminate the identity-explosion problem of G-counters by limiting permanent state to the backbone. The "what counts" property here is both local (each node may fetch a safe lower bound) and global (all increments are eventually reflected), with monotonicity, eventual accounting, and anti-duplication proved. This approach generalizes to a wide class of commutative monoid CRDTs (Almeida et al., 2013).

6. Out-of-Distribution and Open-World Counting: Sensitivity of What Counts under Shift

COUNTS provides a rigorous evaluation of object detectors and multimodal LLMs when "what counts" is exposed to naturally occurring distribution shifts in real-world imagery. The O(OD)² benchmark tests OOD generalization in object detection: models trained on one set of domains (e.g., street, indoor) are evaluated on held-out domains (e.g., snow, sky). The OODG benchmark assesses multimodal LLM visual grounding under in-context example (ICE) shifts—covariate, label, or spurious associations.

Key findings include significant mAP drops (from ≈0.39 to ≈0.21) for detectors in OOD settings, and accuracy drops for MLLMs from ≈66% to as low as 28% under spurious ICE shifts. This exposes the instability of model output with respect to changes in what "counts" as an object under distributional perturbations, even when category definitions remain fixed (Li et al., 14 Apr 2025).

7. Broader Implications and Theoretical Considerations

The research collectively demonstrates that "what counts" is a foundational variable in nearly all algorithmic, neural, and distributed systems involving enumeration and summarization. Sensitivity of counting accuracy to semantic, distributional, or user-defined specification is substantial: approximate implementations and data-driven models instantiate meaning-dependent, unstable computational primitives across domains. These findings invalidate naive assumptions of semantic-invariance in counting and call for model architectures, prompting strategies, and evaluation protocols that foreground careful, explicit definition of "what counts" in each context (Ríos-García et al., 29 Jan 2026, Amini-Naieni et al., 29 Dec 2025, Li et al., 14 Apr 2025).