M^3Searcher: Modular Multimodal Search Agent
- M^3Searcher is a modular, RL-trained agent for multimodal information retrieval that decouples evidence acquisition from answer synthesis.
- It integrates text and image search tools within a broader MRAG framework to perform coherent multi-hop reasoning across diverse inputs.
- Multi-objective reinforcement learning and a specialized dataset enable M^3Searcher to outperform baselines on complex multimodal tasks.
MSearcher (Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning) is a modular, reinforcement learning-trained agent for multimodal information seeking that advances the state-of-the-art in retrieval-augmented, reasoning-centric automation. Notably, it architecturally decouples evidence acquisition from answer synthesis, enabling robust, multi-step question answering over text and vision inputs, and is optimized by a multi-objective reward targeting factual accuracy, reasoning soundness, and retrieval fidelity. MSearcher demonstrates strong adaptability and effectiveness on complex, multi-hop multimodal tasks, outperforming existing approaches on both in-domain and transfer settings (Yu et al., 14 Jan 2026).
1. System Architecture and Modular Workflow
MSearcher is instantiated within a broader Multimodal Retrieval-Augmented Generation (MRAG) framework and comprises two principal modules:
- Planner: A small, efficient multimodal LLM (e.g., Qwen2.5-VL-7B) responsible for interpreting the multimodal user query , decomposing the task into internal reasoning (
> ...</think>blocks), external tool invocation, and evidence aggregation actions. The planner governs episodic search rollouts and decides the termination point. > > - Answer Generator: A large language or multimodal model (e.g., Qwen3-30B-A3B or DeepSeek-R1) which, given the entire evidence trajectory collected by the planner, generates the final user-facing answer. > > Within the planner, three tool modules are accessible: > > 1. Image Search Tool: Reverse image retrieval via the Serper API, which returns the top visually similar image and associated webpage titles/URLs. > > 2. Text Search Tool: Wikipedia-based retrieval and reranking pipeline utilizing E5 text embeddings, returning the top-10 semantically relevant document chunks. > > 3. Expert Answer Generator: Invoked precisely once to produce the conclusive answer given the search and reasoning trace. > > At each step , the agent state is , where is the active observation, is the planner’s reasoning output, is a tool call (if generated), and is the received tool response. The computational trajectory is: > > > > with the constraint that is always the answer generator call (Yu et al., 14 Jan 2026). > > ## 2. Retrieval-Oriented Multi-Objective Reinforcement Learning > > Agent training employs Group-Relative Policy Optimization (GRPO), a PPO-style objective with trajectory-level group normalization and a KL penalty to a reference model. The overall reward is formulated as: > > > > - (Answer Reward): Scalar in yielded by an LLM-judge comparing the generated answer against gold reference answers. > > - (Format/Validity Reward): > > > > > - (Information Retrieval Reward): > > > > > - , graded for correct, cautious, or erroneous visual grounding. > - , proportional to the fraction of reasoning hops covered by correct retrieved evidence. > > The policy objective is: > > > > with: > > > > This reward configuration facilitates balanced exploration and exploitation of heterogeneous retrieval modalities and stringent adherence to reasoning protocols (Yu et al., 14 Jan 2026). > > ## 3. Multimodal Retrieval and Evidence Integration Mechanisms > > Textual and visual retrievals are handled as modular, non-attentive operations: > > - Text Queries: Questions and evidence chunks are embedded by an E5 model, using cosine similarity: > > > > The top matches are injected as evidence. > > - Image Queries: The Serper API executes reverse-image retrieval, returning the most visually similar image and a cache of titles as textual context. > > - Integration Protocol: The planner's observation at each step includes only the relevant context—no end-to-end cross-modal attention within the planner; instead, visual contexts are re-encoded at each "think" step by the vision encoder. > > A plausible implication is that this design sharply reduces resource requirements compared to approaches that concatenate or jointly encode all evidence modalities over many turns (Yu et al., 14 Jan 2026). > > ## 4. MMSearchVQA Dataset: Construction and Properties > > MMSearchVQA is a multimodal multi-hop QA dataset constructed to facilitate retrieval-centric, RL-based training: > > - Source: ReasonVQA’s Wikidata subgraph. > > - Construction Pipeline: > > 1. BFS on Wikidata graph to enumerate multi-hop chains. > 2. Filtering for unique answer paths and minimum hop length (). > 3. Cross-validation with real retrieved Wikipedia passages for each hop; only chains with complete evidence are retained. > 4. Sentence-level evidence gathering. > 5. Difficulty stratification: three levels, with easy examples downsampled. > > Statistics: > > > | #Examples | Modalities | Hop Count | Difficulty Split | Domain Coverage | > |-----------|----------------|-----------|---------------------|----------------------------| > | 6,000 | Image + Text | 2–5 (μ≈2.8) | ≈33%/33%/33% (easy/med/hard) | Broad: geography, architecture, biography, etc. | > > This dataset addresses previous gaps in multi-hop, multimodal search trajectory supervision (Yu et al., 14 Jan 2026). > > ## 5. Training and Hyperparameter Regimen > > - Planner Backbone: Qwen2.5-VL-7B > > - Answer Generator (train-time): Qwen3-30B-A3B > > - RL Algorithm: GRPO with group size , clipping , KL-weight > > - Optimizer: AdamW with learning rate , batch size 16 trajectories > > - Pretraining: Format-only reward for initial 1,000 updates; full reward schedule subsequently > > - Masking: Tool-response tokens are excluded from the RL objective to ensure that only model-generated tokens influence policy updates > > This regimen ensures that the planner learns to optimally coordinate tool calls and assemble evidence chains, while offloading answer generation to the larger model (Yu et al., 14 Jan 2026). > > ## 6. Evaluation Protocols and Empirical Results > > MSearcher is evaluated using the following metrics: > > - Answer Accuracy (LLM-judge) > > - Text Retrieval Score (fraction of hops correctly retrieved) > > - Image Retrieval Score (graded mean) > > - Multi-step Reasoning Success Rate (all hops and final answer correct) > > Benchmarks include both in-domain (MMSearchVQA test, Wikipedia search) and out-of-domain (InfoSeek, MMSearch, MRAG-Bench with Serper/Google search). Main reported results (Table 1 of (Yu et al., 14 Jan 2026)): > > | Benchmark | Model | Accuracy (%) | Notable Comparison | > |-----------|------------------------------|-------------|-----------------------------------| > | MMSearchVQA (in-domain) | MSearcher (Qwen3-30B-A3B) | 54.75 | Best baseline (CogPlanner) 48.37 | > | MMSearch (out-domain) | MSearcher (Qwen3-30B-A3B) | 55.62 | Baseline (CogPlanner) 39.77 | > | MMSearch (out-domain) | MSearcher (DeepSeek-R1) | 63.30 | | > > Ablation studies indicate a significant performance decrease upon removal of or individual retrieval tools (4–6 pp drops occur, while ablation of the answer generator collapses the system). RL-based training increases image search usage from < to , improving visual evidence coverage. Inclusion of the retrieval-oriented reward speeds convergence and induces longer evidence-collection rollouts (mean step count increases from 3.5 to 4.2) (Yu et al., 14 Jan 2026). > > ## 7. Key Mechanistic Insights and Representative Trajectories > > MSearcher’s superiority is explained by the following mechanistic properties: > > - Decoupled Reasoning and Acquisition: Answer synthesis operates on the entire collected evidence, leveraging large model capacity while the planner specializes in modular tool orchestration. > > - Reward-Shaped, Modality-Balanced Search: RL-driven fine-tuning ensures balanced use of text and image retrieval despite pretrained biases. > > - Robustness to Transfer and Data Scarcity: MSearcher maintains strong performance across domains, search engines, and answer generators, as shown by transfer results. > > Representative multimodal episodes (abridged): > > - For "What is the height of the building shown?": > 1.<think>"The image resembles the Eiffel Tower."</think>> 2. ImageSearch yields Eiffel Tower candidates. > 3.<think>"I’ll query the height."</think>> 4. TextSearch fetches a passage with "300 m." > 5. Final answer: "The Eiffel Tower is 300 meters tall." > > - For "Who designed the landmark in the picture?": > 1.<think>"Looks like the Leaning Tower of Pisa."</think>> 2. ImageSearch detects candidate. > 3.<think>"Query its architect." ``- TextSearch returns "Bonanno Pisano."
- Answer: "The Leaning Tower of Pisa was designed by Bonanno Pisano."
This suggests that MSearcher’s modular design and explicit reward shaping yield more coherent, trustworthy, and citation-backed output compared to approaches relying on a monolithic, end-to-end LLM context (Yu et al., 14 Jan 2026).