Query-Guided Captions

Updated 21 January 2026

Query-guided captioning is a technique that generates text descriptions conditioned on specific queries to focus on critical multimodal evidence.
It employs query-centric attention mechanisms and gradient-based selection to refine visual question answering, video retrieval, and scene-text captioning.
Recent methods use multi-stage reasoning and chain-of-captions protocols, achieving accuracy gains of up to 12% on benchmarks.

Query-guided captions are textual descriptions generated or selected to be maximally informative with respect to a specific query or question—rather than providing generic, question-agnostic descriptions. This paradigm is central to modern multimodal reasoning systems, notably visual question answering (VQA), data visualization, scene-text image captioning, video retrieval, and multi-image or multi-modal LLM-based reasoning. Recent architectures achieve state-of-the-art performance by conditioning caption generation and selection on the query itself, either via targeted attention, explicit prompt decomposition, adaptive filtering, or gradient-based alignment. This article synthesizes the dominant methodologies and empirical findings from key research such as "Generating Question Relevant Captions to Aid Visual Question Answering" (Wu et al., 2019), "Narrating the Video" (Hur et al., 7 Mar 2025), "QG-CoC: Question-Guided Chain-of-Captions" (Kao et al., 5 Nov 2025), and related works.

1. Formal Definition and Task Scope

Query-guided captioning generalizes conventional captioning by conditioning the output on both the multimodal input (image, audio, data, video, code) and a user-specified query or task. The output is a caption $C$ that maximizes the conditional probability $P(C | X, Q)$ , where $X$ is the content (e.g. image(s), video), and $Q$ is the query.

In VQA: captions are generated (or selected) to support answering a visual question about an image.
In data visualization: captions are produced to highlight analysis aspects relevant to a user-specified instruction or question (Liew et al., 2022).
In multi-image LLMs: the QG-CoC protocol decomposes complex queries into sub-questions and generates targeted captions for each visual input (Kao et al., 5 Nov 2025).
In scene-text tasks: question-controlled captioning explicitly fuses questions with OCR and object features to provide personalized descriptions (Hu et al., 2021).

A generic formulation is:

$C^* = \arg\max_{C} P(C\,|\,X, Q)$

where $C^*$ should focus exclusively on evidence and relationships required to resolve $Q$ .

2. Joint Architectures and Attention Mechanisms

Architectures for query-guided captioning invariably integrate cross-modal encodings and query-centric attention. The approach in (Wu et al., 2019) exemplifies this with (i) a bottom-up image encoder (Faster R-CNN+ResNet-101), (ii) a GRU-encoded question vector $q$ , and (iii) top-down attention $A^{qv}$ generating question-conditioned region features $V^q$ . Captioning is performed by a GRU with word-wise attention over $V^q$ , supervised only on the most query-relevant caption per (image, question) pair.

The VQA module further embeds generated/human captions using a dual-GRU attention over words, modulating the image features via a second attention mechanism $A^{cv}$ :

$a^{cv}_k = f(f(c) \circ f(v^q_k)), \qquad \overline{v}^{qc} = \sum_k \alpha^{cv}_k v^q_k$

where $c$ is the caption embedding and $\alpha^{cv}_k$ are softmax-normalized weights. The joint representation for answer prediction is:

$h = q \circ (f(\overline v^{qc}) + f(c))$

which enforces joint reasoning over question, caption, and conditioned image regions.

In video retrieval, NarVid (Hur et al., 7 Mar 2025) applies frame-level captioning via LLMs, then performs cross-modal co-attention between frame features $V$ and their captions $N$ , followed by query-aware adaptive filtering that prunes irrelevant frames based on cosine similarity to the query.

3. Caption Selection and Alignment Strategies

A critical aspect is selecting or generating captions actually beneficial to the task—whether VQA accuracy, retrieval, or multi-step reasoning. (Wu et al., 2019)’s gradient-based online selection is archetypal: for $C$ human captions per image, it computes the dot-product between gradients of the answer prediction logit ( $G^{vqa}_k$ ) and the caption generation log-probability ( $G^{c_i}_k$ ) with respect to shared image features. The optimal caption $i^*$ satisfies:

$i^* = \arg\max_i \sum_{k=1}^K (G^{vqa}_k)^\top (G^{c_i}_k), \qquad \sum_k (G^{vqa}_k)^\top (G^{c_i}_k) > \xi$

where $\xi > 0$ is a threshold, and the loss is omitted if no caption meets the constraint.

Alternative selection mechanisms include query-aware nucleus sampling in video (NarVid)—sorting frame/caption pairs by query similarity and retaining only the nucleus for further matching.

4. Prompt-based and Chain-of-Captions Methods

Zero-shot prompting methods for query-guided captioning, notably QG-CoC (Kao et al., 5 Nov 2025), structure the reasoning pipeline as:

Decompose the query $Q$ into sub-questions $\{q_1, ..., q_K\}$ .
For each image $I_i$ and sub-question $q_j$ , generate a focused caption $C_{i,j}$ answering $q_j$ for $I_i$ .
Aggregate these captions as evidence to answer each sub-question.
Integrate the sub-answers as the final response.

The method leverages prompt templates:

Step 1) Decompose the question into q1, ..., qK.
Step 2) For each qj, write a caption for each image focusing on visual details that help answer qj.
Step 3) Answer each sub-question using the captions.
Step 4) Integrate to answer the original question.

This yields gains of up to +12% accuracy on multi-image reasoning benchmarks (MUIR, MuirBench) for open-source and closed-source MLLMs.

5. Empirical Performance Across Domains

Empirical studies demonstrate substantial task-specific improvements from query guidance. (Wu et al., 2019) reports VQA test-standard accuracy of 68.37% for its single model—surpassing baselines by 2–3%, and ensembles reach 69.66%. Ablations show question-agnostic captions provide far less improvement than targeted ones (Up-Down only: 63.2%; + question-agnostic: 64.6%; + question-relevant: 65.8%; + COCO captions: 67.1%).

NarVid’s comprehensive query-guided caption integration increases video retrieval R@1 from 44.5% (CLIP4Clip) to 51.0%, with 1–4pt jumps per module (cross-modal, query filtering, dual-modal, hard negatives) (Hur et al., 7 Mar 2025).

QG-CoC yields up to +12% accuracy gain for multi-image tasks. Question-controlled scene-text captioning (GQAM, (Hu et al., 2021)) outperforms non-control and naïve baselines in CIDEr (+129–140 pts), SPICE (+17–18 pts), and answer recall (+10–17 pp.).

6. Mechanistic Analysis: Why Query Guidance Works

Query-guided captions provide multiple mechanistic benefits:

Additional Knowledge: Captions inject rich attributes, object relations, counts, and non-visual priors beyond pure region features, directly supporting answers to “Num” and “Other” categories (Wu et al., 2019).
Focused Information: Caption selection conditioned on the query suppresses irrelevant, noisy text, concentrating model attention on necessary evidence (Wu et al., 2019, Hur et al., 7 Mar 2025).
Modulated Attention: Caption embeddings steer visual attention modules toward regions mentioned in the caption, improving alignment with human annotations and interpretability (measured by Earth Mover's Distance; caption-attention reduces EMD by 0.08) (Wu et al., 2019).
Consistent Optimization: Gradient-based or classifier-based selection ensures loss signals do not conflict, yielding stable training and superior generalization (Wu et al., 2019).
Reasoning Structure: Multi-stage, prompt-based chains like QG-CoC ensure fact extraction and aggregation is explicitly guided by the decomposed query, reducing the likelihood of missing edge cases or fine-grained heterogeneity (Kao et al., 5 Nov 2025).

7. Limitations and Prospects

Limitations include dependency on high-quality, query-relevant caption collections, computational cost for online gradient methods, and degradation for models with weak captioning ability (Kao et al., 5 Nov 2025). Query guidance requires the system to understand the semantic relationships encoded in the query—prompt misunderstanding or under-specified queries can lead to sub-optimal outputs. The approach is also sensitive to domain distribution; cross-domain generalization favors stylistic diversity in training corpora (Ng et al., 2020).

Ongoing research is extending query-guided captioning to video streams, sequential multi-image reasoning, and code-to-text paradigms (SQL2Text), employing graph-aware selection and iterative refinement to maximize informativeness and alignment with user intent (Al-Lawati et al., 6 Jan 2025). Benchmarks and datasets are available to evaluate new methods under diverse constraint regimes.

In sum, query-guided captioning is a convergent frontier for multimodal systems, leveraging question- or query-conditioned generation, targeted attention, and structured reasoning to produce captions optimized for downstream understanding and task-specific inference. Methodological innovation spans gradient-based supervision, prompt engineering, attention modulation, and multi-stage reasoning chains, with documented impact on VQA, data visualization, scene-text captioning, video retrieval, and code understanding (Wu et al., 2019, Hur et al., 7 Mar 2025, Kao et al., 5 Nov 2025, Hu et al., 2021, Liew et al., 2022, Al-Lawati et al., 6 Jan 2025).