Relational Visual Similarity

Published 8 Dec 2025 in cs.CV, cs.AI, and cs.LG | (2512.07833v1)

Abstract: Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-LLM to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel framework that captures relational visual similarity through group-based captioning and contrastive learning.
It leverages a vision-language model to disentangle relational logic from attribute biases, improving retrieval and analogical image generation.
Empirical results show that the relsim model outperforms baselines like LPIPS, CLIP, and DINO, aligning more closely with human judgment.

Relational Visual Similarity: A Framework for Human-Like Analogical Perception

Introduction

Visual similarity metrics play a central role in modern computer vision, underpinning tasks in retrieval, recognition, and generative modeling. Canonical approaches, including LPIPS, CLIP, and DINO, conceptualize similarity primarily in terms of perceptual or attribute features, such as color, texture, and class semantics. However, human perception also robustly supports relational similarity: abstraction over analogical structure, where the mapping between entities is defined by the correspondence of relationships rather than raw attribute overlap. "Relational Visual Similarity" (2512.07833) systematically formulates, benchmarks, and learns this missing axis of similarity, proposing new data, models, and evaluation protocols for capturing logic- and abstraction-driven visual correspondence. This essay analyzes the technical design, empirical results, and implications of the work.

Figure 1: Overview of the pipeline for constructing a relational image similarity model leveraging image filtering, anonymous group-based captioning, and contrastive vision-LLM training.

Problem Formulation and Dataset Construction

While attribute similarity is easily framed via low- (pixels), mid- (features), or high-level (concepts/classes) signal matching, relational similarity requires comparing the relations among visual elements. The authors identify that existing datasets are fundamentally limited in this regard: BAPPS and ImageNet, for example, only support attribute correspondence.

To address this, the pipeline begins with large-scale filtering of LAION-2B. A VLM (Qwen2.5-VL-7B) is fine-tuned on human-labeled data to select “relationally interesting” images—those with compositional or analogical structures. This process surfaces roughly 114k images, which are then further curated.

Figure 2: Examples contrasting images deemed relationally interesting (complex internal structure) versus ordinary (surface-level attributes).

Crucially, the relational structure is not directly annotated at the image level but is instead revealed via anonymous group-based captioning. Manually curated groups of images (532 groups, 2-10 images/group) are used to elicit the shared underlying logic (i.e., “growth process of {Subject} described in 4 stages”). VLMs generate and humans verify such group-based captions that avoid object-specific language, acting as abstraction anchors for the images.

Figure 3: Demonstration that group context is necessary to reliably elicit anonymous relational captions that avoid overfitting to surface content.

Relational Visual Similarity Model

Given image–caption pairs $\{(I_i, A_i)\}$ , the relational visual similarity (relsim) model is realized as a vision-language encoder (Qwen2.5-VL-7B for images, MiniLM for text), operating under an InfoNCE-style contrastive learning objective. Images are mapped close in feature space if and only if their anonymous (relational) captions are similar. The process discards surface-level alignment, rewarding logic-based correspondence.

This marks a methodological pivot: instead of leveraging vision-only backbones, the approach insists that relational reasoning is fundamentally dependent on language-based abstraction and world knowledge. By freezing text encoders and instructing the image encoder with logic-focused prompts, the model is explicitly disentangled from attribute similarity bias.

Empirical Analysis

Evaluation Protocol

Evaluation is performed in the context of image retrieval: given a query, retrieve its most relationally similar instance from a mixed real-world database. Baselines include LPIPS, CLIP (image and text), DINO, and caption-driven retrieval via CLIP-T and Qwen-T; these span the spectrum from low-level to semantic similarity and include both direct and language-proxy approaches. Notably, caption baselines operate by independently describing images with Qwen-generated anonymous captions, highlighting the necessity of group-driven abstraction.

Automated judgments are performed via GPT-4o, which is instructed to focus exclusively on analogical/relational logic, ignoring raw perceptual and semantic similarity.

Figure 4: Quantitative comparison of retrieval metrics — relsim outperforms all established baselines at relational similarity matching.

The main findings:

Standard metrics fail: LPIPS (4.56) and DINO (5.14) underperform on relational similarity, validating that self-supervised and pixel/feature-level models lack analogical abstraction.
Vision-language superiority: CLIP-I (5.91) is stronger due to richer semantics, but remains attribute-dominated. Only the relsim VLM (6.77) meaningfully encodes relational logic.
Group-based captioning is critical: Captioning from a single image (CLIP-T, Qwen-T) is insufficient, inadvertently leaking attribute content.
Vision-only finetuning is not enough: Even finetuned DINO/CLIP on the relational dataset cannot match VLM-based relsim, emphasizing the unique role of language.

Human Alignment

A/B user studies comparing relsim with each baseline confirm that humans robustly prefer the relsim-driven retrievals (42.5–60.7% preference rates), and that relational similarity is both perceived and operationalized distinct from attribute-based judgments.

Figure 5: AB test user study: relsim achieves significantly higher preference rates than all attribute-based or one-image captioning baselines.

Relational and Attribute Similarity as Complementary Signals

Visualization of the similarity space demonstrates that relational similarity and attribute similarity span orthogonal axes. Their combination yields the richest assessments, enabling discovery of logical matches regardless of attribute coincidence.

Figure 6: Two-dimensional visualization shows that relational and attribute similarity partition the space in qualitatively distinct ways.

Qualitative Retrieval and Generation

By constructing a similarity search space, the authors illustrate that relsim can retrieve analogically related images that are maximally dissimilar at the pixel or class level.

Figure 7: Given a query, relsim retrieves images matching its underlying logic rather than surface appearance — enabling logic-driven search.

For generative tasks, relsim provides a direct handle for analogical image generation: unlike normal editing, which enforces attribute transfers, relsim can guide VLMs or diffusion models to map or generate images that share conceptual structure with a provided prompt.

Figure 8: “Analogical image generation” — relsim enables generation that preserves logic/structure rather than superficial attributes.

Quantitative benchmarks evidence a performance gap: proprietary models (GPT-4o-image, Gemini 2.5) better support analogical generation over open-source VLMs, but the upper bound is defined by relsim rather than CLIP or LPIPS.

Figure 9: Proprietary models show improved ability for complex relational transformations, yet open-source models lack in analogical generation.

Theoretical and Practical Implications

Theoretical Implications: By formalizing relational visual similarity, the paper exposes an axis of abstraction critical for human-like perception and analogy. This challenges the attribute-dominated paradigm and suggests that next-generation models must fundamentally encode relations and logic, ideally through language–vision synergy.

Practical Implications: Applications include retrieval (finding images matching not by class but by conceptual logic), analogical editing and idea generation, creative image composition, and richer evaluation benchmarks for generative models. Notably, the relsim metric can diagnose and drive improvements in open-source VLMs, which currently lag in analogical reasoning.

Limitations and Future Directions

The main constraint of the current approach is scalability: anonymous captioning depends on manually curated groups and VLM-derived captions, which may encode biases or omit relevant but subtle logics. Additionally, each image can support multiple relational structures; handling user intent and structure selection is yet unsolved. Automating the expansion and annotation of relational logics is necessary for broader adoption.

Conclusion

By proposing relsim and a scalable pipeline for capturing relational visual similarity, this work fundamentally extends the operational definition of image similarity in computer vision. The results establish robust empirical evidence that attribute and relational similarity are distinct and complementary, demand vision-language modeling, and directly support human-aligned analogical reasoning. Future developments likely include large-scale relational benchmarking, improved analogical generation, and unified joint modeling of attribute and relational similarity dimensions.

Markdown