Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

Published 24 Feb 2026 in cs.CV | (2602.21175v1)

Abstract: Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative LLM as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-LLM (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes Quality-Controllable Query Completion (QCQC) to enrich underspecified queries with explicit quality parameters, enhancing semantic and aesthetic retrieval.
It leverages a fine-tuned generative LLM to generate quality-aware, contextually detailed prompts that are compatible with existing pretrained vision-language models.
Experimental results on MS-COCO and Flickr2.4M demonstrate robust improvements in both relevance and aesthetics over state-of-the-art baseline methods.

Quality-Controllable Visual Retrieval via LLM-Driven Query Enrichment

Problem Motivation and Formulation

The paper addresses a critical limitation in current text-to-image retrieval (T2IR) pipelines: their susceptibility to semantically ambiguous, underspecified short queries, which are prevalent in real-world applications. State-of-the-art vision-LLMs (VLMs), while offering robust semantic alignment, demonstrate degraded retrieval performance when tasked with retrieval based on such terse queries. This is primarily attributed to (1) semantic ambiguity resulting in a diffuse, poorly discriminated search subspace, (2) semantic collision in similarity-based ranking, and (3) the inability to modulate retrieval quality (e.g., aesthetics, interestingness) in response to user needs—a deficit that disables explicit quality control, which is especially relevant for personalized and content-driven retrieval scenarios.

To resolve these issues, the work introduces the paradigm of quality-controllable retrieval (QCR). Rather than merely seeking semantic congruence, QCR explicitly incorporates user-controllable quality parameters—specifically, relevance (semantic consistency) and aesthetics (visual appeal)—as conditioning variables. The goal is to steer retrieval such that, for a given short query, the returned images also match a user-articulated or system-conditioned quality regime.

Methodology: Quality-Conditioned Query Completion (QCQC)

The central technical contribution is the Quality-Conditioned Query Completion (QCQC) module, which operationalizes QCR in a manner compatible with any pretrained VLM without the need for architectural modification. The method can be summarized as follows:

Query Completion via LLM: A generative LLM is leveraged as a query completion engine. Given a short query and discretized quality conditions, the LLM is fine-tuned to generate semantically richer, quality-aware textual prompts that encode both class-level attributes and quality parameters (e.g., pose, scene details, aesthetic descriptors).
Quality Conditioning: Auxiliary annotation of the image gallery is conducted, where each image is assigned a relevance score (cosine similarity in a VLM joint embedding space) and an aesthetic score (output of a pretrained aesthetic evaluation model). Both scores are discretized into human-interpretable levels (e.g., Low, Medium, High), which are used as explicit conditioning variables during both LLM training and inference.
Data Generation and Model Training: For each image, a textual caption (from a captioning model), relevance score, and aesthetic score are collated. An instruction-concatenation format is adopted for LLM fine-tuning, ensuring that generated query completions are quality-aware and context-relevant. The training loss is standard autoregressive next-token prediction, with explicit conditioning on discretized quality levels.
Retrieval Pipeline: At retrieval time, the system generates quality-conditioned query completions for the underspecified query, encoding the user’s quality preference in the prompt. The VLM is then queried with these enriched prompts to select images whose representation best aligns with the generated embedding in the joint space.

Theoretical Analysis

The authors provide a formal analysis of how query completion potentially increases the discriminative power of the retrieval system. Modeling query enrichment as a structured perturbation in the text embedding space, they show that, under mild assumptions (including non-degeneracy of singular values and independence of new directions introduced by completion), the post-completion query set can have increased rank in the score matrix. This higher-rank structure enables finer-grained distinctions among images, improving both selectivity and controllability—a property not available with the original underspecified queries.

Experimental Validation

Comprehensive experiments are conducted on two datasets: MS-COCO (with human-annotated captions) and Flickr2.4M (CC0-licensed images with generated captions). The evaluation focuses on both quantitative metrics (average relevance and aesthetics of retrieved images) and qualitative case studies.

Key empirical findings:

The QCQC method consistently outperforms strong baselines (prefix-only, generic LLM completion, FT w/ random quality assignment, and commercial LLMs such as GPT-4o and LLaMA-3) in both relevance and aesthetics across quality conditions.
Unlike post-retrieval filtering approaches, which only enable late-stage re-ranking by quality at the cost of semantic drift, QCQC achieves native, end-to-end quality control by generating dataset- and condition-aware queries.
The approach is robust to cross-modal caption quality, model backbone, and the granularity of quality discretization (demonstrated efficacy with both three-level and five-level discretizations).
Qualitative results demonstrate the system's ability to generate diverse, interpretable query completions that accurately reflect the target quality condition and the dataset context.

Notably, the method’s adaptability is maintained regardless of whether the base dataset has human or generated captions, provided that the LLM is properly fine-tuned on the dataset-contextualized triplets (query, relevance, aesthetics). However, cross-dataset generalization is limited due to dataset-specific semantics, highlighting the importance of context-aware finetuning.

Implications and Future Work

This method aligns with a broader objective in multimodal retrieval systems: enabling user-driven, fine-grained controllability without necessitating heavy retraining or altering pretrained VLM architectures. Practically, QCQC provides transparent and interpretable interfaces for quality modulation, suitable for diverse application domains (creative search, e-commerce, education). The approach can generalize to other notions of quality (e.g., interestingness, diversity), contingent on the availability of scoring models.

Theoretically, the rank-enhancing property of the approach suggests connections to model expressivity and information bottlenecks in multimodal retrieval. Integrating more advanced or multimodal-aware scoring models, or extending to user-specific personalization and adaptive prompt engineering, represent promising future directions.

Conclusion

The paper introduces a novel solution to the challenge of under-specified queries in text-to-image retrieval via quality-conditioned query completion. By finetuning an LLM to produce quality-controlled descriptions and leveraging existing VLMs as frozen backbones, the proposed framework achieves transparent, interpretable, and fine-grained control of both semantic and quality aspects in large-scale retrieval. The method demonstrates significant empirical gains in relevance and aesthetics, robust cross-dataset applicability (subject to finetuning), and offers a modular foundation for future research in controllable, user-driven multimodal search systems.

Markdown Report Issue