M^3Searcher: Modular Multimodal Search Agent

Updated 16 January 2026

M^3Searcher is a modular, RL-trained agent for multimodal information retrieval that decouples evidence acquisition from answer synthesis.
It integrates text and image search tools within a broader MRAG framework to perform coherent multi-hop reasoning across diverse inputs.
Multi-objective reinforcement learning and a specialized dataset enable M^3Searcher to outperform baselines on complex multimodal tasks.

M $^3$ Searcher (Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning) is a modular, reinforcement learning-trained agent for multimodal information seeking that advances the state-of-the-art in retrieval-augmented, reasoning-centric automation. Notably, it architecturally decouples evidence acquisition from answer synthesis, enabling robust, multi-step question answering over text and vision inputs, and is optimized by a multi-objective reward targeting factual accuracy, reasoning soundness, and retrieval fidelity. M $^3$ Searcher demonstrates strong adaptability and effectiveness on complex, multi-hop multimodal tasks, outperforming existing approaches on both in-domain and transfer settings (Yu et al., 14 Jan 2026).

1. System Architecture and Modular Workflow

M $^3$ Searcher is instantiated within a broader Multimodal Retrieval-Augmented Generation (MRAG) framework and comprises two principal modules:

Planner: A small, efficient multimodal LLM (e.g., Qwen2.5-VL-7B) responsible for interpreting the multimodal user query $(v,q)$ $(v, q)$ , decomposing the task into internal reasoning (> ...</think> blocks), external tool invocation, and evidence aggregation actions. The planner governs episodic search rollouts and decides the termination point. > > - Answer Generator: A large language or multimodal model (e.g., Qwen3-30B-A3B or DeepSeek-R1) which, given the entire evidence trajectory collected by the planner, generates the final user-facing answer. > > Within the planner, three tool modules are accessible: > > 1. Image Search Tool: Reverse image retrieval via the Serper API, which returns the top visually similar image and associated webpage titles/URLs. > > 2. Text Search Tool: Wikipedia-based retrieval and reranking pipeline utilizing E5 text embeddings, returning the top-10 semantically relevant document chunks. > > 3. Expert Answer Generator: Invoked precisely once to produce the conclusive answer given the search and reasoning trace. > > At each step $t$ $t$ , the agent state is $(O_t, \alpha_t, C_t, I_t)$ $(O_{t}, α_{t}, C_{t}, I_{t})$ , where $O_t$ $O_{t}$ is the active observation, $\alpha_t$ $α_{t}$ is the planner’s reasoning output, $C_t$ $C_{t}$ is a tool call (if generated), and $I_t$ $I_{t}$ is the received tool response. The computational trajectory is: > > $\mathcal{T} = \{O_1, \alpha_1, C_1, I_1, ..., O_T, \alpha_T, C_T, I_T\}$ > > with the constraint that $C_T$ $C_{T}$ is always the answer generator call (Yu et al., 14 Jan 2026). > > ## 2. Retrieval-Oriented Multi-Objective Reinforcement Learning > > Agent training employs Group-Relative Policy Optimization (GRPO), a PPO-style objective with trajectory-level group normalization and a KL penalty to a reference model. The overall reward is formulated as: > > $R = \alpha R_{\text{factual}} + \beta R_{\text{reasoning}} + \gamma R_{\text{retrieval}}$ > > - $R_{\text{factual}}$ (Answer Reward): Scalar in $[0,1]$ $[0, 1]$ yielded by an LLM-judge comparing the generated answer $I_T$ $I_{T}$ against gold reference answers. > > - $R_{\text{reasoning}}$ (Format/Validity Reward): > > > $R_{\text{reasoning}} = \begin{cases} 0, & \textrm{if all tool-call syntax/termination rules are correct}\ -1, & \textrm{otherwise} \end{cases}$ > > - $R_{\text{retrieval}}$ (Information Retrieval Reward): > > > $R_{\text{retrieval}} = R_{\text{TextRetrieval}} + R_{\text{ImgRetrieval}}$ > > - $R_{\text{ImgRetrieval}} \in \{0, 0.25, 0.5\}$ $R_{ImgRetrieval} \in {0, 0.25, 0.5}$ , graded for correct, cautious, or erroneous visual grounding. > - $R_{\text{TextRetrieval}} \in [0, 0.5]$ $R_{TextRetrieval} \in [0, 0.5]$ , proportional to the fraction of reasoning hops covered by correct retrieved evidence. > > The policy objective is: > > $\mathcal{J}(\theta) = \mathbb{E}_{i, t}\left[\min\left(\rho^i_t A^i_t, \mathrm{clip}(\rho^i_t, 1-\epsilon, 1+\epsilon)A^i_t\right)\right] - \beta_{\mathrm{KL}}\mathrm{KL}(\pi_\theta \Vert \pi_{\mathrm{ref}})$ > > with: > > $\rho^i_t = \frac{\pi_\theta(y^i_t \mid s^i_t)}{\pi_{\mathrm{ref}}(y^i_t \mid s^i_t)}, \quad A^i_t = \frac{R_i - \mathrm{mean}(\{R_i\})}{\mathrm{std}(\{R_i\})}$ > > This reward configuration facilitates balanced exploration and exploitation of heterogeneous retrieval modalities and stringent adherence to reasoning protocols (Yu et al., 14 Jan 2026). > > ## 3. Multimodal Retrieval and Evidence Integration Mechanisms > > Textual and visual retrievals are handled as modular, non-attentive operations: > > - Text Queries: Questions $q$ $q$ and evidence chunks $d$ $d$ are embedded by an E5 model, using cosine similarity: > > $\mathrm{sim}(q, d) = \frac{e_q \cdot e_d}{\|e_q\|\|e_d\|}$ > > The top matches are injected as evidence. > > - Image Queries: The Serper API executes reverse-image retrieval, returning the most visually similar image and a cache of titles as textual context. > > - Integration Protocol: The planner's observation at each step includes only the relevant context—no end-to-end cross-modal attention within the planner; instead, visual contexts are re-encoded at each "think" step by the vision encoder. > > A plausible implication is that this design sharply reduces resource requirements compared to approaches that concatenate or jointly encode all evidence modalities over many turns (Yu et al., 14 Jan 2026). > > ## 4. MMSearchVQA Dataset: Construction and Properties > > MMSearchVQA is a multimodal multi-hop QA dataset constructed to facilitate retrieval-centric, RL-based training: > > - Source: ReasonVQA’s Wikidata subgraph. > > - Construction Pipeline: > > 1. BFS on Wikidata graph to enumerate multi-hop chains. > 2. Filtering for unique answer paths and minimum hop length ( $\geq 2$ $\geq 2$ ). > 3. Cross-validation with real retrieved Wikipedia passages for each hop; only chains with complete evidence are retained. > 4. Sentence-level evidence gathering. > 5. Difficulty stratification: three levels, with easy examples downsampled. > > Statistics: > > > | #Examples | Modalities | Hop Count | Difficulty Split | Domain Coverage | > |-----------|----------------|-----------|---------------------|----------------------------| > | 6,000 | Image + Text | 2–5 (μ≈2.8) | ≈33%/33%/33% (easy/med/hard) | Broad: geography, architecture, biography, etc. | > > This dataset addresses previous gaps in multi-hop, multimodal search trajectory supervision (Yu et al., 14 Jan 2026). > > ## 5. Training and Hyperparameter Regimen > > - Planner Backbone: Qwen2.5-VL-7B > > - Answer Generator (train-time): Qwen3-30B-A3B > > - RL Algorithm: GRPO with group size $G=8$ $G = 8$ , clipping $\epsilon=0.2$ $ϵ = 0.2$ , KL-weight $\beta_{\mathrm{KL}}=0.01$ $β_{KL} = 0.01$ > > - Optimizer: AdamW with learning rate $5\times10^{-6}$ $5 \times 1 0^{- 6}$ , batch size 16 trajectories > > - Pretraining: Format-only reward for initial 1,000 updates; full reward schedule subsequently > > - Masking: Tool-response tokens are excluded from the RL objective to ensure that only model-generated tokens influence policy updates > > This regimen ensures that the planner learns to optimally coordinate tool calls and assemble evidence chains, while offloading answer generation to the larger model (Yu et al., 14 Jan 2026). > > ## 6. Evaluation Protocols and Empirical Results > > M $^3$ $^{3}$ Searcher is evaluated using the following metrics: > > - Answer Accuracy (LLM-judge) > > - Text Retrieval Score (fraction of hops correctly retrieved) > > - Image Retrieval Score (graded mean) > > - Multi-step Reasoning Success Rate (all hops and final answer correct) > > Benchmarks include both in-domain (MMSearchVQA test, Wikipedia search) and out-of-domain (InfoSeek, MMSearch, MRAG-Bench with Serper/Google search). Main reported results (Table 1 of (Yu et al., 14 Jan 2026)): > > | Benchmark | Model | Accuracy (%) | Notable Comparison | > |-----------|------------------------------|-------------|-----------------------------------| > | MMSearchVQA (in-domain) | M $^3$ $^{3}$ Searcher (Qwen3-30B-A3B) | 54.75 | Best baseline (CogPlanner) 48.37 | > | MMSearch (out-domain) | M $^3$ $^{3}$ Searcher (Qwen3-30B-A3B) | 55.62 | Baseline (CogPlanner) 39.77 | > | MMSearch (out-domain) | M $^3$ $^{3}$ Searcher (DeepSeek-R1) | 63.30 | | > > Ablation studies indicate a significant performance decrease upon removal of $R_{\text{retrieval}}$ $R_{retrieval}$ or individual retrieval tools (4–6 pp drops occur, while ablation of the answer generator collapses the system). RL-based training increases image search usage from < $5\%$ $5%$ to $≈25\%$ $\approx 25%$ , improving visual evidence coverage. Inclusion of the retrieval-oriented reward speeds convergence and induces longer evidence-collection rollouts (mean step count increases from 3.5 to 4.2) (Yu et al., 14 Jan 2026). > > ## 7. Key Mechanistic Insights and Representative Trajectories > > M $^3$ $^{3}$ Searcher’s superiority is explained by the following mechanistic properties: > > - Decoupled Reasoning and Acquisition: Answer synthesis operates on the entire collected evidence, leveraging large model capacity while the planner specializes in modular tool orchestration. > > - Reward-Shaped, Modality-Balanced Search: RL-driven fine-tuning ensures balanced use of text and image retrieval despite pretrained biases. > > - Robustness to Transfer and Data Scarcity: M $^3$ $^{3}$ Searcher maintains strong performance across domains, search engines, and answer generators, as shown by transfer results. > > Representative multimodal episodes (abridged): > > - For "What is the height of the building shown?": > 1. <think> "The image resembles the Eiffel Tower." </think> > 2. ImageSearch yields Eiffel Tower candidates. > 3. <think> "I’ll query the height." </think> > 4. TextSearch fetches a passage with "300 m." > 5. Final answer: "The Eiffel Tower is 300 meters tall." > > - For "Who designed the landmark in the picture?": > 1. <think> "Looks like the Leaning Tower of Pisa." </think> > 2. ImageSearch detects candidate. > 3. <think> "Query its architect." ``
1. TextSearch returns "Bonanno Pisano."
2. Answer: "The Leaning Tower of Pisa was designed by Bonanno Pisano."

This suggests that M $^3$ Searcher’s modular design and explicit reward shaping yield more coherent, trustworthy, and citation-backed output compared to approaches relying on a monolithic, end-to-end LLM context (Yu et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M$^3$Searcher.

M^3Searcher: Modular Multimodal Search Agent

1. System Architecture and Modular Workflow

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

M^3Searcher: Modular Multimodal Search Agent

1. System Architecture and Modular Workflow

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research