OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Published 6 May 2026 in cs.CV | (2605.05185v1)

Abstract: Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper develops an open, end-to-end data and RL training recipe for building robust multimodal search agents.
It introduces a scalable data curation pipeline that synthesizes high-quality multi-hop VQA trajectories using both visual and textual cues.
OpenSearch-VL achieves significant accuracy improvements over prior models by leveraging tool-interleaved, agentic reasoning strategies.

OpenSearch-VL: An Open Recipe for Frontier Multimodal Deep Search Agents

Introduction and Motivation

OpenSearch-VL targets a key bottleneck in the development of high-performance multimodal search agents: the lack of transparent, reproducible, and high-quality data and training recipes for agentic multimodal reasoning systems. Existing state-of-the-art models for knowledge-intensive visual question answering (VQA) and web-based multimodal reasoning have largely relied on proprietary datasets, closed tool trajectories, and private codebases, limiting the community's ability to replicate, analyze, or extend these frontier systems. OpenSearch-VL proposes an open, end-to-end training recipe that systematically addresses these gaps by releasing datasets, agentic RL training protocols, and code, with demonstrated state-of-the-art performance on complex multimodal deep search tasks.

Data Curation Pipeline

The core of OpenSearch-VL is its scalable data curation pipeline, which synthesizes high-quality, multi-turn, multi-hop VQA trajectories tailored to tool-demanding reasoning over both visual and textual evidence. The pipeline proceeds through three key stages:

Multi-hop VQA Instance Construction: Starting from the Wikipedia hyperlink graph, the system samples constrained entity paths, generates canonical and fuzzified question–answer pairs by rewriting entity mentions to prevent shortcut retrieval, and grounds anchor entities with CLIP-filtered representative images. This design enforces a compositional reasoning requirement, explicitly suppressing single-hop shortcut solutions and encouraging multi-hop, tool-interleaved behavior.
Filtering and Visual Enhancement: The instance pool is refined through staged filtering to eliminate trivial, perceptual, or parametric-knowledge answerable samples. An enhancement subset is synthesized by degrading source images—via blur, distortion, and downsampling—and pairing them with image restoration tool-based solutions, incentivizing the agent to engage in “think-with-image” behavior under real-world visual imperfections.
Trajectory Synthesis: Expert multi-turn trajectories are generated by rolling out Claude Opus 4.6 against the constructed environment, invoking all tools as required. Trajectories are then passed through a two-stage rejection filtering: final-answer correctness (via GPT-4o) and a process-level judge (via GPT-5.4) scoring logical, tool-use, and non-repetitive behavior.
Figure 1: Overview of the data curation pipeline, illustrating the construction of high-quality multi-hop VQA data, staged filtering, and trajectory synthesis in a real tool environment.

Two primary datasets result: SearchVL-SFT-36k (for supervised fine-tuning) and SearchVL-RL-8k (for RL-based exploration).

Tool Environment

One distinctive strength of OpenSearch-VL is its unified, extensible tool suite, which supports agentic, iterative reasoning strategies that span both visual and textual input spaces. The tool environment is partitioned into three modality families:

Retrieval Tools: TextSearch (web search with summarization) and ImageSearch (reverse image search and entity recognition) for explicit knowledge acquisition.
Image Enhancement Tools: Sharpen, SuperResolution, and PerspectiveCorrect to address real-world image noise, blurring, low resolution, and perspective errors.
Attention & Parsing Tools: Crop (for spatial localization) and OCR (for layout-aware text extraction).

This environment empowers agents to go beyond pure retrieval. They can actively repair, manipulate, and parse images before attempting knowledge-based queries, addressing inherent limitations of prior retrieval-focused agentic architectures.

Training Algorithm: Fatal-Aware Multi-turn RL

OpenSearch-VL introduces a novel multi-turn Group Relative Policy Optimization (GRPO) algorithm, specifically adapted for long-horizon, multi-tool agentic interaction with a real multimodal environment. This RL protocol incorporates several vital innovations:

Composite Multi-turn Reward: The RL reward integrates final-task accuracy (via LLM judge), process-level search quality (GPT-5.4), and strict format compliance, providing dense, credit-assigned feedback for both successful and partially-successful rollouts.
Fatal-Aware Masking: To prevent noisy gradient signals from cascading execution failures (e.g., repeated tool errors), a fatal-aware token mask truncates the learning signal at the point of sustained failures, preserving only the valid prefix of reasoning and tool-use steps before fatality.
One-Sided Advantage Clamping: Unlike hard-masking/trajectory-drop baselines, one-sided clamping ensures that viable pre-failure prefixes can still yield positive gradient updates if their normalized reward exceeds the group mean, avoiding unnecessary suppression of productive partial behaviors.
Figure 2: Overview of the RL training pipeline, sampling trajectories, applying composite rewards, and using fatal-aware masking for stable multi-turn policy optimization.

Figure 3: Training dynamics over the RL phase. Fatal-aware GRPO sustains a higher number of turns and reaches higher accuracy compared to alternatives.

Figure 4: Fatal-aware masking with one-sided clamping illustrated on rollouts. Productive prefixes of fatal trajectories are positively reinforced when group-normalized rewards are high.

Figure 5: Aggregate distribution showing that most fatal rollouts are clamped to zero, while those with high partial rewards are preserved, mirroring successful non-fatal rollout behavior.

Numerical Results and Ablation

Comprehensive evaluation across seven challenging multimodal deep search benchmarks demonstrates robust, scalable improvements for OpenSearch-VL, including:

Qwen3-VL-30B-A3B model improves average accuracy from 47.8 to 61.6 (a +13.8 point margin over the prior strongest open source agent) and achieves individual gains of +13.3 (VDR), +24.5 (MMSearch), +10.2 (FVQA), and +16.2 (InfoSeek).
The largest model, OpenSearch-VL-32B, attains 63.7 average accuracy, exceeding both direct-reasoning baselines and competing with leading proprietary commercial systems.

Extensive ablation studies on the SFT data pipeline and RL recipe show:

Removal of core data pipeline stages (fuzzy rewriting, source-anchor decoupling, staged filtering) causes large performance drops (−8 to −14 points), demonstrating that both search chain compositionality and tool requirement are critical.
Replacing fatal-aware GRPO with vanilla RL or hard-masking suppresses the benefits of partially successful exploration, validating the importance of the proposed RL objective.

Agent Trajectory Case Study

An illustrative OpenSearch-VL trajectory demonstrates compositional, tool-interleaved reasoning answering “In what year did this bridge open?”:

The agent crops the image to focus on a road sign, uses image search to ground the bridge’s identity, and follows up with text search to extract the opening year from supporting webpages. This process chains visual attention, entity linking, and cross-modal retrieval, exemplifying the behavior incentivized by OpenSearch-VL’s reward and data curriculum.
Figure 6: OpenSearch-VL uses crop for visual focus, image search for entity linking, and text search for final answer verification in a realistic search scenario.

Theoretical and Practical Implications

OpenSearch-VL offers a comprehensive, open recipe for developing frontier multimodal deep search agents, providing both the data generators and RL training APIs needed for scalable, reproducible research. The ablation findings and benchmark performance underline that advanced multimodal search capability depends on i) staged, shortcut-minimizing, tool-demanding data; and ii) long-horizon RL algorithms robust to cascading failures and partial success.

Practically, OpenSearch-VL establishes a transparent reference for future studies into agentic reasoning, RL-driven tool use, and search-based retrieval under visually noisy and knowledge-intensive scenarios. Theoretically, it enables systematic analysis of how data, tool orchestration, and reward design interact to produce emergent agentic search behavior in large multimodal models.

Future Directions

Despite its strengths, OpenSearch-VL still contends with issues such as external tool reliability, judge-dependence for reward assignment, and challenges in exact experimental reproducibility owing to third-party API drift. Continuous work is warranted on fully open reward models, internalized perception-retrieval pipelines, robustness to tool malfunctions, and scaling (in both model and tool diversity). The open nature of the released data, code, and models is intended to catalyze broad community research on next-generation agentic multimodal systems.

Conclusion

OpenSearch-VL advances the standard for open development of multimodal search agents by delivering both substantive engineering (data generation, tool APIs, RL algorithms) and empirically-strong, carefully-validated results. Its release lowers the entry threshold for systematic research and benchmarking, powering progress in agentic multimodal intelligence for complex, visually-grounded retrieval and reasoning (2605.05185).

Markdown Report Issue