SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models

Published 18 Sep 2025 in cs.IR | (2509.15432v1)

Abstract: Visual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images. We revisit a zero-shot generate-and-encode pipeline: a vision-LLM first produces a detailed textual description of each document image, which is then embedded by a standard text encoder. On the ViDoRe-v2 benchmark, the method reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual document encoder. It also scales better to large collections and offers broader multilingual coverage. Analysis shows that modern vision-LLMs capture complex textual and visual cues with sufficient granularity to act as a reusable semantic proxy. By offloading modality alignment to pretrained vision-LLMs, our approach removes the need for computationally intensive text-image contrastive training and establishes a strong zero-shot baseline for future VDR systems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel zero-shot method for visual document retrieval by generating detailed textual descriptions with vision-language models and embedding them with a text encoder.
It achieves an nDCG@5 of 63.4% on the ViDoRe-v2 benchmark, outperforming state-of-the-art supervised multi-vector models.
The approach supports multilingual and cross-domain retrieval with flexible offline processing to reduce online latency and computational costs.

SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and LLMs

The paper "SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and LLMs" addresses the challenge of Visual Document Retrieval (VDR) by introducing a zero-shot method that utilizes large Vision-LLMs (VLMs) to generate textual descriptions from document images, followed by embedding these descriptions using a text encoder. This method outperforms traditional approaches that rely on specialized bi-encoders trained specifically for this task.

Zero-Shot VDR Approach

Figure 1: Zero-shot VDR using VLM-generated document descriptions and a pretrained text encoder.

The proposed approach employs a two-step process. First, a VLM generates detailed textual descriptions of document images. These descriptions capture both textual content and visual elements, effectively bridging the modality gap between text and images. Second, a pretrained text encoder embeds these descriptions, facilitating efficient retrieval by mapping queries and documents into a shared semantic space. This method obviates the need for expensive contrastive training on large-scale datasets, instead leveraging the intrinsic capabilities of modern VLMs to understand and describe visual content in natural language. The decoupled architecture allows for flexibility in combining the best available VLMs and text encoders, optimizing performance without task-specific training data.

Performance Evaluation

Figure 2: nDCG@5 for zero-shot Visual Document Retrieval using VLMs and text encoders of varying scales. Despite no task-specific training, our zero-shot method could compete with end-to-end models explicitly trained for VDR on large-scale text-(document) image datasets.

The zero-shot method exhibits competitive performance, achieving an nDCG@5 of 63.4% on the ViDoRe-v2 benchmark, which surpasses the best supervised multi-vector models. This demonstrates that high-quality VLM-generated descriptions are sufficient for effective VDR. The findings also show that scaling the text encoder yields better improvements than scaling the VLM, with large VLMs benefiting most from strong text encoder integrations.

Implementation and Computational Consideration

The zero-shot generate-and-encode process is conducted offline, hence it does not impact online retrieval latency. This allows preprocessing to be conducted without time constraints, accommodating even high computational loads due to large VLMs. During deployment, pruning and quantization techniques can be employed to accelerate inference, which is crucial for reducing computational costs in large-scale applications. Smaller VLMs, such as the InternVL3-2B model, offer an attractive trade-off between speed and accuracy, achieving competitive results with reduced processing demands.

Multilingual and Cross-Domain Capabilities

The method's robust performance extends across multiple languages and domains, outperforming traditional VDR models especially in multilingual and cross-lingual retrieval tasks. This suggests a significant advantage in applications requiring wide coverage across diverse document types and languages, highlighting the potential for scaling to extensive collections like those encountered in enterprise and global information systems.

Conclusion

The research highlights the efficacy of leveraging modern VLMs for zero-shot visual document retrieval. By decoupling description generation and encoding, the method capitalizes on advancements in vision-language integration, bypassing the need for specific training datasets and thereby lowering entry barriers for widespread adoption. This work sets a new baseline, encouraging further exploration into zero-shot solutions for multimodal information retrieval across varied linguistic and domain-specific contexts. Future work is suggested to focus on bridging any residual gaps in performance between supervised and zero-shot systems, particularly for non-English and low-resource languages where enhancements remain promising.

Markdown Report Issue