Identify Relationally Salient Details from a Single Image for Anonymous Captioning

Ascertain which visual details in a single image are irrelevant versus constitutive of the underlying relational pattern when generating an anonymous caption that abstracts the image’s logic.

Background

To generate anonymous captions that capture relational logic, the authors observe that writing such a caption from a single image is challenging because it is difficult to discern which elements are essential to the relational structure and which should be ignored.

They address this challenge by using groups of images that share the same underlying logic to help reveal the shared relational structure, but explicitly note the uncertainty when only a single image is available.

References

Writing a shared relational attribute from a single image is inherently challenging. For example, given only a sequence depicting a butterfly’s flight stages (Fig.\ref{fig:caption_by_group}, first row), it is unclear which visual details are irrelevant and which constitute the underlying relational pattern.

Relational Visual Similarity  (2512.07833 - Nguyen et al., 8 Dec 2025) in Relational Visual Similarity — Creating a Relational Dataset — Generating anonymous captions