Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diffusion-Augmented Interactive T2I Retrieval

Updated 4 February 2026
  • The paper presents a zero-shot retrieval pipeline fusing LLM-based text reformulation with diffusion-generated proxies to boost Hits@10 gains.
  • It employs multi-view contrastive learning to mitigate generative hallucination by aligning text and visual cues through robust training objectives.
  • The framework dynamically adjusts cross-modal fusion weights across dialogue turns, outperforming traditional fine-tuned multimodal models.

Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) is an emerging paradigm for interactive multi-turn cross-modal retrieval, in which the system synthesizes generative visual proxies (via text-to-image diffusion models) at each dialogue turn and fuses these with LLM-guided text representations to drive image ranking. DAI-TIR circumvents the need for task-specific multimodal encoder fine-tuning by leveraging large pretrained language and diffusion models, delivering zero-shot, highly generalizable performance in dynamic, multi-round retrieval scenarios. Recent advances address the challenge of generative hallucination—synthetic proxies may diverge from user intent—through robust contrastive training objectives that semantically filter out inconsistent cues, substantially improving alignment, retrieval accuracy, and generalization across domains (Long et al., 26 Jan 2025, Zhang et al., 28 Jan 2026).

1. Problem Formulation and Motivations

In Interactive Text-to-Image Retrieval (I-TIR), the goal is to retrieve a relevant image from a database I={Ij}j=1N\mathcal{I} = \{I_j\}_{j=1}^N given a dialogue context Cn={D0,Q1,A1,,Qn,An}C_n = \{D_0, Q_1, A_1, \ldots, Q_n, A_n\}, where D0D_0 is the initial user description, and (Qi,Ai)(Q_i, A_i) are system/user interactions. Traditional I-TIR approaches rely on finetuned multimodal encoders (e.g., BLIP2, BEiT-3), which impose significant computational costs and decrease robustness to distributional shift due to narrowed pretrained knowledge (Long et al., 26 Jan 2025).

DAI-TIR introduces an alternative, exploiting two pretrained generative components per turn: an LLM-based reformulator R1\mathcal{R}_1 that condenses CtC_t into an encoder-friendly query StS_t, and a diffusion model G\mathcal{G} that synthesizes KK proxy images {In(k)}\{I_n^{(k)}\} from diverse prompts Pt,kP_{t,k} derived via another LLM pipeline, capturing visual facets of user intent. These representations are embedded, fused, and used for cross-modal image ranking, without updating any encoder weights.

2. DAI-TIR: Algorithmic Pipeline

The DAI-TIR framework formalizes retrieval as follows (Long et al., 26 Jan 2025, Zhang et al., 28 Jan 2026):

  1. Dialogue Reformulation: Compute St=R1(Ct)S_t = \mathcal{R}_1(C_t).
  2. Prompt Diversification: For k=1,,Kk=1,\ldots,K, generate Pt,k=R2(St,k)P_{t,k} = \mathcal{R}_2(S_t, k).
  3. Generative Synthesis: Obtain visual proxies {I^t,k}\{\hat I_{t,k}\} via I^t,k=G(Pt,k)\hat I_{t,k} = \mathcal{G}(P_{t,k}).
  4. Representation Encoding:
    • Textual: tt=Et(St)\mathbf{t}_t = E_t(S_t)
    • Visual (proxies): it,k=Ev(I^t,k)\mathbf{i}_{t,k} = E_v(\hat I_{t,k})
    • Visual (candidates): ij=Ev(Ij)\mathbf{i}_j = E_v(I_j)
  5. Cross-Modal Fusion: Fuse the representations:

Ft=αtt+βk=1Kit,k,α+β=1F_t = \alpha\,\mathbf{t}_t + \beta \sum_{k=1}^K \mathbf{i}_{t,k},\quad \alpha+\beta=1

with α/β\alpha/\beta controlling text/image weighting per turn.

  1. Ranking: Score candidates via

s(Ij,Ft)=ij,FtijFts(I_j, F_t) = \frac{\langle \mathbf{i}_j, F_t \rangle}{\|\mathbf{i}_j\|\|F_t\|}

and rank I\mathcal{I} accordingly.

This pipeline is zero-shot: no training or loss-specific optimization is used. All components leverage pretrained models.

3. Empirical Performance and Ablation Analysis

Extensive evaluation on VisDial, ChatGPT_BLIP2, HUMAN_BLIP2, and FLAN-Alpaca-XXL_BLIP2 benchmarks (2,064 dialogues each, 10 turns per dialogue) with Hits@10 as the metric demonstrates key findings (Long et al., 26 Jan 2025):

  • On FLAN_BLIP2 after 10 rounds, zero-shot DAR achieves a +7.61% absolute gain in Hits@10 over the zero-shot BLIP baseline.
  • Across diverse benchmarks, zero-shot DAI-TIR consistently improves Hits@10 by 4–6% relative to prior state-of-the-art finetuned models (e.g., ChatIR), and even surpasses them on the hardest, distributionally shifted settings (up to +4.22%).
  • With K=1K=1 proxy, DAI-TIR already outperforms finetuned ChatIR by +6.43%; further increasing KK to $3$ yields up to +7.61% but plateaus beyond that.
  • Cross-modal fusion weighting evolves through the dialogue: early turns use (α,β)=(0.7,0.3)(\alpha,\beta) = (0.7,0.3) (text-biased), later turns (0.5,0.5)(0.5,0.5) (equal emphasis).

Efficiency is high: inference per turn requires only \sim0.5s (LLM reformulation) and \sim5s (diffusion generation) on commodity hardware. No fine-tuning yields zero training cost. Generalization is strong due to the absence of distribution-narrowing induced by fine-tuning.

4. Role and Challenge of Diffusion Proxies

A key innovation in DAI-TIR is leveraging visual proxies from diffusion models as “generative views” of user intent. However, generative proxies derived from underspecified prompts often contain hallucinated content—attributes, objects, colors, or spatial relations that are not textually specified, filled in by the diffusion prior (Zhang et al., 28 Jan 2026).

Empirical analysis using chain-of-thought V-L judges (Qwen3-VL, Gemma3) shows \approx40% of generated proxies include some visual inconsistency. This hallucination can misalign the proxy embedding, causing performance degradation—in some settings, baseline diffusion-augmented BEiT-3 even underperforms its zero-shot text-only counterpart in early rounds due to noisy synthetic cues.

Table: Types of Hallucination in Diffusion Proxies

Category Example Error Type
Attribute mismatch Color, shape, count error
Extra/spurious object Object not in query
Spatial/action error Misplaced or wrong action

These phenomena highlight the necessity of hallucination-robust architectures for DAI-TIR.

5. Diffusion-aware Multi-view Contrastive Learning (DMCL)

To address diffusion-induced hallucination, Diffusion-aware Multi-view Contrastive Learning (DMCL) was introduced as a robust training framework (Zhang et al., 28 Jan 2026). DMCL treats textual, diffusion-derived, fused, and target-image embeddings as “multiple views” tied to the same underlying user intent, and applies a combination of:

  • Diffusion-aware contrastive loss (Ldiff\mathcal{L}_{\mathrm{diff}}): Multi-view symmetric InfoNCE alignment across text, diffusion, and fused representations to the true image.
  • Hard-negative mining (LHNM\mathcal{L}_{\mathrm{HNM}}): Explicitly pushes apart top-K confusable negatives per view.
  • Text–Diffusion semantic consistency (Lsem\mathcal{L}_{\mathrm{sem}}): Combines feature-level InfoNCE alignment and distribution-level (Jensen-Shannon) agreement, penalizing divergence between text and diffusion retrieval distributions.

The total loss is:

L=λdiffLdiff+λsemLsem\mathcal{L} = \lambda_{\mathrm{diff}}\mathcal{L}_{\mathrm{diff}} + \lambda_{\mathrm{sem}}\mathcal{L}_{\mathrm{sem}}

These pressures force the encoding backbone (BEiT-3) to filter out hallucinated cues, resulting in a representation space focused on the stable semantic core shared across modalities, while mapping irrelevant generative noise into a null subspace.

6. Model Architecture, Training, and Embedding Analysis

  • The backbone remains the pretrained BEiT-3 base; representation heads ΦT\Phi_T, ΦD\Phi_D, ΦI\Phi_I, and ΦF\Phi_F (small MLPs) project each modality into a dd-dimensional, 2\ell_2-normalized embedding.
  • Fusion is element-wise sum or concatenate-linear followed by normalization.
  • The DA-VisDial training set is used: \sim1M samples, each representing a 3-turn dialogue with diffusion proxies generated from Stable Diffusion 3.5.
  • Optimization uses AdamW, label smoothing, temperature scaling, and hard negative margin.

Analyses demonstrate sharper, higher-mean cosine similarity distributions for positive pairs post-DMCL, indicating noise suppression. Attention heatmaps reveal that DMCL enhances focus on intent-relevant regions, discarding hallucinated backgrounds or attributes.

7. Experimental Findings, Limitations, and Future Directions

DMCL achieves strong gains over prior DAI-TIR and text-only retrieval baselines on multiple datasets (Zhang et al., 28 Jan 2026):

  • On VisDial, cumulative Hits@10 rises from 75.3% (ChatIR_DAR) to 82.7% after 10 turns (+7.37%).
  • Comparable improvements (3.78%–6.49%) are observed on ChatGPT_BLIP2, HUMAN_BLIP2, Flan-Alpaca-XXL_BLIP2, and PlugIR_dataset.

Ablation studies indicate that most improvement derives from multi-view query-target alignment (Ldiff\mathcal{L}_{\mathrm{diff}}), with the semantic consistency term (Lsem\mathcal{L}_{\mathrm{sem}}) providing further stabilization.

Limitations remain: fusion is a simple additive scheme; more sophisticated (e.g., cross-attention) or adaptive proxy selection may further enhance performance. A plausible implication is that future research on dynamic proxy filtering, hierarchical proxy modeling, or joint diffusion-retrieval training may further suppress hallucination and refine intent alignment.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR).