Specify Intended Relational Structure via Text Prompts

Determine how to use text prompts to specify which relational structure a user intends in an image that can embody multiple distinct relational structures, enabling relational visual similarity systems such as the relsim model to unambiguously select the desired relational mapping for retrieval or generation tasks.

Background

The paper introduces relational visual similarity (relsim), which aligns image embeddings with anonymous captions that encode underlying relational logic rather than surface attributes. The authors note that a single image may simultaneously express multiple relational structures, leading to multiple valid relational mappings.

This ambiguity complicates downstream applications such as relational image retrieval and analogical image generation, where a user’s intended relational logic needs to be clearly identified. The authors explicitly state that determining how text prompts can be used to disambiguate and select the intended relational structure remains an open question.

References

Last but not least, we acknowledge that one image can embody multiple different relational structures, potentially leading to multiple valid relational mappings. Determining how to use text prompts to specify which relational structure a user intends remains an open question.

Relational Visual Similarity  (2512.07833 - Nguyen et al., 8 Dec 2025) in Conclusion and Discussion