Instance-Level CIR Benchmark

Updated 11 February 2026

Instance-Level CIR Benchmark is a framework that evaluates retrieval systems by distinguishing specific object instances using combined image and text queries.
It employs diverse data construction paradigms with hard negatives and strict instance definitions to enforce fine-grained compositional reasoning.
It measures performance with metrics like mAP, Recall@K, and GAP, highlighting challenges in domain shift and one-shot learning conditions.

Instance-level CIR benchmarks are designed to evaluate retrieval systems that must distinguish between images of specific object instances—rather than semantic categories—using composed queries that combine a reference image and a textual modifier describing a desired transformation. This paradigm enforces fine-grained, object-centric reasoning and rules out solutions that rely solely on semantic or text-dominated similarity. Benchmarks of this type interrogate a system's capability to perform genuine composition, ground multi-modal signals, and disambiguate closely related instances in large-scale, long-tail, or high-noise settings.

1. Definition and Motivation

Instance-level composed image retrieval (CIR) benchmarks specify the retrieval task as follows: given a query image $q^v$ depicting a particular physical object and a modifier text $q^t$ encoding a transformation (e.g., “at sunset”), the goal is to retrieve target images that show the same object instance (not just any member of its class) under the modification defined by $q^t$ . Unlike semantic-level CIR, which tolerates retrieval of any on-class item, instance-level CIR penalizes both false substitutions (different object, right attribute) and semantic shortcuts (right class, wrong instance).

The motivation for instance-level CIR lies in the complexity of real-world retrieval—where object identity, not just class attributes, is essential—and in the need for robust benchmarks that force systems to capture compositionality and grounding rather than relying on degenerate image-only or text-only cues (Psomas et al., 29 Oct 2025).

2. Benchmark Construction Paradigms

Recent instance-level CIR benchmarks employ diverse data construction strategies, each imposing strict controls to guarantee instance-level discrimination and high difficulty.

i-CIR Benchmark

The i-CIR dataset (Psomas et al., 29 Oct 2025) collects 202 object instances (each defined by several manually curated reference images), spanning domains from landmarks and products to characters. For each instance, multiple query images and text modifications are constructed; positives are exhaustively annotated to include only images showing the exact same object under the specified transformation. Hard negatives are harvested using a semi-automatic process: candidate images from LAION are filtered and then manually labeled as (a) visual hard negatives (same instance, wrong text), (b) textual hard negatives (right text, wrong instance), and (c) composed hard negatives (matching only one modality).

Benchmark	Instances	Images	Queries	DB size/query
i-CIR	202	~750,000	1,883	3,700
EUFCC-CIR	n/a	346,324	174,985	n/a
The Met	224,408	397,121	~20,000	Full DB

EUFCC-CIR and The Met

EUFCC-CIR (Net et al., 2024) leverages museum and GLAM collections, imposing instance definitions via Getty AAT hierarchical labels: triplets are sampled such that only a single material/object-type leaf changes across query-modifier-target triplets, all other facets are fixed. The Met dataset (Ypsilantis et al., 2022) directly equates each physical exhibit with an instance and enforces retrieval of visitor photos to studio-shot gallery images, introducing major domain shift and demanding precise instance discrimination, often in few-shot regimes.

3. Evaluation Protocols and Metrics

Instance-level CIR benchmarks employ metrics sensitive to true composition and instance discrimination:

mean Average Precision (mAP): Macro- or per-instance mAP is used to aggregate precision-recall over all queries and instances (Psomas et al., 29 Oct 2025, Ypsilantis et al., 2022).
Recall@K: Measures proportion of queries retrieving at least one true positive in top $K$ ranked results; reported for various $K$ (Net et al., 2024).
Global Average Precision (GAP): Combines in-distribution and OOD queries into a single ranking, particularly for open-set or OOD detection protocols (Ypsilantis et al., 2022).

Distinct evaluation regimes are enforced:

Per-instance sub-databases (i-CIR): Each instance is searched only within its own database, reflecting the challenge of distinguishing among visually similar confounders.
Multi-target and OOD splits (EUFCC-CIR, The Met): Queries may have multiple valid targets; splits are constructed to assess robustness to both in-domain and distribution-shifted queries.

4. Methodological Approaches

Several methodological innovations have emerged to specifically address the demands of instance-level CIR:

Multi-modal late fusion with strict logical AND operators (e.g., in FreeDom/BASIC: $S(q^v, q^t, x) = \hat{s}^v \cdot \hat{s}^t - \lambda (\hat{s}^v + \hat{s}^t)^2$ ) penalizes images that match only one modality, rewarding true composition (Psomas et al., 29 Oct 2025).
Feature centering and semantic projections: Centering embedding spaces across LAION-scale corpora and projecting onto principal directions driven by object names improve fine-grained discrimination.
Hard negative mining and manually verified negatives significantly enhance benchmark difficulty and preclude shortcut solutions (Psomas et al., 29 Oct 2025).
Supervised and self-supervised contrastive learning: The Met (Ypsilantis et al., 2022) shows non-parametric kNN classifiers, trained with supervised contrastive losses on real and hardest negatives, outperform standard parametric classifiers—in part due to the long-tail one-shot nature of many art instances.
Domain-specific zero-shot protocols: The EUFCC-CIR benchmark (Net et al., 2024) evaluates retrieval robustness across both well-represented (“inner test”) and under-represented (“outer test”) provider splits, and establishes that simple image+text averaging outperforms sophisticated zero-shot compositional methods.

5. Empirical Findings and Baseline Performance

Empirical results on instance-level benchmarks consistently reveal that:

Naive image- or text-only matching yields extremely low mAP or Recall@1, often below 3–5% (i-CIR, Table 1 (Psomas et al., 29 Oct 2025); EUFCC-CIR Table (Net et al., 2024)).
Fusion approaches significantly outperform unimodal models, but composition-aware fusion (late multiplier, Harris penalty) achieves highest performance.
The FreeDom (BASIC) method achieves state-of-the-art macro-mAP (31.64%) on i-CIR, outpacing all prior semantic-level CIR pipelines by >3 mAP absolute (Psomas et al., 29 Oct 2025).
The Met: supervised contrastive + non-parametric kNN achieves up to 36.1 GAP and 52.4 GAP $^-$ on test (Ypsilantis et al., 2022); parametric classifiers substantially underperform.
On EUFCC-CIR, Recall@1 remains below 5% for all zero-shot methods evaluated, highlighting dataset difficulty and strong domain shift effects (Net et al., 2024).

Key ablation studies demonstrate that: every FreeDom component—feature centering, semantic projection, contextualization, and Harris fusion—contributes distinctly, with no single step dominating, and that dedicated domain corpora provide further gains (Psomas et al., 29 Oct 2025).

6. Significance, Limitations, and Outlook

Instance-level CIR benchmarks deliver a rigorous testbed for compositional, fine-grained, and open-set retrieval across a range of domains—landmarks, cultural heritage, consumer goods, and artworks.

Principal significance:

They reward models able to genuinely perform compositional reasoning under severe negative mining and high inter-instance similarity (Psomas et al., 29 Oct 2025).
They expose the weaknesses of text-dominated or semantic-matching solutions, forcing authentic integration of both visual and textual information.
Instance-level protocols are essential for real-world applications—such as art cataloging, product search, or robust landmark verification—where object identity is non-negotiable.

However, limitations exist. Datasets such as i-CIR and The Met, while compact, are still not exhaustive in domain coverage or instance complexity; labeling remains labor-intensive and rarely covers fine-grained attribute variations beyond the constructed triplets. The fusion approaches (e.g., Harris penalty) may require adaptation for text-heavy or highly asymmetrical scenarios (Psomas et al., 29 Oct 2025).

Advancing this area will likely depend on: expanding labeled corpora, developing adaptive fusion mechanisms, integrating paraphrastic and attribute augmentation pipelines, and evolving open-set detection metrics that remain faithful to the compositional object-centric mandate.

Instance-level CIR shares key challenges with large-scale instance retrieval (e.g., GLDv2 (Weyand et al., 2020)) and domain-shift retrieval (e.g., The Met (Ypsilantis et al., 2022)), but is distinguished by its multi-modal composition requirement. Techniques from one-shot learning, long-tail recognition, and out-of-distribution detection are particularly relevant.

Promising future research directions include:

Generating synthetic queries with richer, multi-attribute modifiers (Xing et al., 27 May 2025)
Curriculum-driven instance complexity escalation
Explicit inclusion of OOD queries and adversarial hard negatives
Evaluation in challenging unconstrained settings (e.g., open world, streaming image feeds)

Instance-level CIR benchmarks thus set a demanding standard for the next generation of multi-modal retrieval architectures and compositional representation learning.