Papers
Topics
Authors
Recent
Search
2000 character limit reached

M^3Searcher: Modular Multimodal Search Agent

Updated 16 January 2026
  • M^3Searcher is a modular, RL-trained agent for multimodal information retrieval that decouples evidence acquisition from answer synthesis.
  • It integrates text and image search tools within a broader MRAG framework to perform coherent multi-hop reasoning across diverse inputs.
  • Multi-objective reinforcement learning and a specialized dataset enable M^3Searcher to outperform baselines on complex multimodal tasks.

M3^3Searcher (Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning) is a modular, reinforcement learning-trained agent for multimodal information seeking that advances the state-of-the-art in retrieval-augmented, reasoning-centric automation. Notably, it architecturally decouples evidence acquisition from answer synthesis, enabling robust, multi-step question answering over text and vision inputs, and is optimized by a multi-objective reward targeting factual accuracy, reasoning soundness, and retrieval fidelity. M3^3Searcher demonstrates strong adaptability and effectiveness on complex, multi-hop multimodal tasks, outperforming existing approaches on both in-domain and transfer settings (Yu et al., 14 Jan 2026).

1. System Architecture and Modular Workflow

M3^3Searcher is instantiated within a broader Multimodal Retrieval-Augmented Generation (MRAG) framework and comprises two principal modules:

  • Planner: A small, efficient multimodal LLM (e.g., Qwen2.5-VL-7B) responsible for interpreting the multimodal user query (v,q)(v,q), decomposing the task into internal reasoning (> ...</think> blocks), external tool invocation, and evidence aggregation actions. The planner governs episodic search rollouts and decides the termination point. > > - Answer Generator: A large language or multimodal model (e.g., Qwen3-30B-A3B or DeepSeek-R1) which, given the entire evidence trajectory collected by the planner, generates the final user-facing answer. > > Within the planner, three tool modules are accessible: > > 1. Image Search Tool: Reverse image retrieval via the Serper API, which returns the top visually similar image and associated webpage titles/URLs. > > 2. Text Search Tool: Wikipedia-based retrieval and reranking pipeline utilizing E5 text embeddings, returning the top-10 semantically relevant document chunks. > > 3. Expert Answer Generator: Invoked precisely once to produce the conclusive answer given the search and reasoning trace. > > At each step tt, the agent state is (Ot,αt,Ct,It)(O_t, \alpha_t, C_t, I_t), where OtO_t is the active observation, αt\alpha_t is the planner’s reasoning output, CtC_t is a tool call (if generated), and ItI_t is the received tool response. The computational trajectory is: > > T={O1,α1,C1,I1,...,OT,αT,CT,IT}\mathcal{T} = \{O_1, \alpha_1, C_1, I_1, ..., O_T, \alpha_T, C_T, I_T\} > > with the constraint that CTC_T is always the answer generator call (Yu et al., 14 Jan 2026). > > ## 2. Retrieval-Oriented Multi-Objective Reinforcement Learning > > Agent training employs Group-Relative Policy Optimization (GRPO), a PPO-style objective with trajectory-level group normalization and a KL penalty to a reference model. The overall reward is formulated as: > > R=αRfactual+βRreasoning+γRretrievalR = \alpha R_{\text{factual}} + \beta R_{\text{reasoning}} + \gamma R_{\text{retrieval}} > > - RfactualR_{\text{factual}} (Answer Reward): Scalar in [0,1][0,1] yielded by an LLM-judge comparing the generated answer ITI_T against gold reference answers. > > - RreasoningR_{\text{reasoning}} (Format/Validity Reward): > > > Rreasoning={0,if all tool-call syntax/termination rules are correct 1,otherwiseR_{\text{reasoning}} = \begin{cases} 0, & \textrm{if all tool-call syntax/termination rules are correct}\ -1, & \textrm{otherwise} \end{cases} > > - RretrievalR_{\text{retrieval}} (Information Retrieval Reward): > > > Rretrieval=RTextRetrieval+RImgRetrievalR_{\text{retrieval}} = R_{\text{TextRetrieval}} + R_{\text{ImgRetrieval}} > > - RImgRetrieval{0,0.25,0.5}R_{\text{ImgRetrieval}} \in \{0, 0.25, 0.5\}, graded for correct, cautious, or erroneous visual grounding. > - RTextRetrieval[0,0.5]R_{\text{TextRetrieval}} \in [0, 0.5], proportional to the fraction of reasoning hops covered by correct retrieved evidence. > > The policy objective is: > > J(θ)=Ei,t[min(ρtiAti,clip(ρti,1ϵ,1+ϵ)Ati)]βKLKL(πθπref)\mathcal{J}(\theta) = \mathbb{E}_{i, t}\left[\min\left(\rho^i_t A^i_t, \mathrm{clip}(\rho^i_t, 1-\epsilon, 1+\epsilon)A^i_t\right)\right] - \beta_{\mathrm{KL}}\mathrm{KL}(\pi_\theta \Vert \pi_{\mathrm{ref}}) > > with: > > ρti=πθ(ytisti)πref(ytisti),Ati=Rimean({Ri})std({Ri})\rho^i_t = \frac{\pi_\theta(y^i_t \mid s^i_t)}{\pi_{\mathrm{ref}}(y^i_t \mid s^i_t)}, \quad A^i_t = \frac{R_i - \mathrm{mean}(\{R_i\})}{\mathrm{std}(\{R_i\})} > > This reward configuration facilitates balanced exploration and exploitation of heterogeneous retrieval modalities and stringent adherence to reasoning protocols (Yu et al., 14 Jan 2026). > > ## 3. Multimodal Retrieval and Evidence Integration Mechanisms > > Textual and visual retrievals are handled as modular, non-attentive operations: > > - Text Queries: Questions qq and evidence chunks dd are embedded by an E5 model, using cosine similarity: > > sim(q,d)=eqedeqed\mathrm{sim}(q, d) = \frac{e_q \cdot e_d}{\|e_q\|\|e_d\|} > > The top matches are injected as evidence. > > - Image Queries: The Serper API executes reverse-image retrieval, returning the most visually similar image and a cache of titles as textual context. > > - Integration Protocol: The planner's observation at each step includes only the relevant context—no end-to-end cross-modal attention within the planner; instead, visual contexts are re-encoded at each "think" step by the vision encoder. > > A plausible implication is that this design sharply reduces resource requirements compared to approaches that concatenate or jointly encode all evidence modalities over many turns (Yu et al., 14 Jan 2026). > > ## 4. MMSearchVQA Dataset: Construction and Properties > > MMSearchVQA is a multimodal multi-hop QA dataset constructed to facilitate retrieval-centric, RL-based training: > > - Source: ReasonVQA’s Wikidata subgraph. > > - Construction Pipeline: > > 1. BFS on Wikidata graph to enumerate multi-hop chains. > 2. Filtering for unique answer paths and minimum hop length (2\geq 2). > 3. Cross-validation with real retrieved Wikipedia passages for each hop; only chains with complete evidence are retained. > 4. Sentence-level evidence gathering. > 5. Difficulty stratification: three levels, with easy examples downsampled. > > Statistics: > > > | #Examples | Modalities | Hop Count | Difficulty Split | Domain Coverage | > |-----------|----------------|-----------|---------------------|----------------------------| > | 6,000 | Image + Text | 2–5 (μ≈2.8) | ≈33%/33%/33% (easy/med/hard) | Broad: geography, architecture, biography, etc. | > > This dataset addresses previous gaps in multi-hop, multimodal search trajectory supervision (Yu et al., 14 Jan 2026). > > ## 5. Training and Hyperparameter Regimen > > - Planner Backbone: Qwen2.5-VL-7B > > - Answer Generator (train-time): Qwen3-30B-A3B > > - RL Algorithm: GRPO with group size G=8G=8, clipping ϵ=0.2\epsilon=0.2, KL-weight βKL=0.01\beta_{\mathrm{KL}}=0.01 > > - Optimizer: AdamW with learning rate 5×1065\times10^{-6}, batch size 16 trajectories > > - Pretraining: Format-only reward for initial 1,000 updates; full reward schedule subsequently > > - Masking: Tool-response tokens are excluded from the RL objective to ensure that only model-generated tokens influence policy updates > > This regimen ensures that the planner learns to optimally coordinate tool calls and assemble evidence chains, while offloading answer generation to the larger model (Yu et al., 14 Jan 2026). > > ## 6. Evaluation Protocols and Empirical Results > > M3^3Searcher is evaluated using the following metrics: > > - Answer Accuracy (LLM-judge) > > - Text Retrieval Score (fraction of hops correctly retrieved) > > - Image Retrieval Score (graded mean) > > - Multi-step Reasoning Success Rate (all hops and final answer correct) > > Benchmarks include both in-domain (MMSearchVQA test, Wikipedia search) and out-of-domain (InfoSeek, MMSearch, MRAG-Bench with Serper/Google search). Main reported results (Table 1 of (Yu et al., 14 Jan 2026)): > > | Benchmark | Model | Accuracy (%) | Notable Comparison | > |-----------|------------------------------|-------------|-----------------------------------| > | MMSearchVQA (in-domain) | M3^3Searcher (Qwen3-30B-A3B) | 54.75 | Best baseline (CogPlanner) 48.37 | > | MMSearch (out-domain) | M3^3Searcher (Qwen3-30B-A3B) | 55.62 | Baseline (CogPlanner) 39.77 | > | MMSearch (out-domain) | M3^3Searcher (DeepSeek-R1) | 63.30 | | > > Ablation studies indicate a significant performance decrease upon removal of RretrievalR_{\text{retrieval}} or individual retrieval tools (4–6 pp drops occur, while ablation of the answer generator collapses the system). RL-based training increases image search usage from <5%5\% to 25%≈25\%, improving visual evidence coverage. Inclusion of the retrieval-oriented reward speeds convergence and induces longer evidence-collection rollouts (mean step count increases from 3.5 to 4.2) (Yu et al., 14 Jan 2026). > > ## 7. Key Mechanistic Insights and Representative Trajectories > > M3^3Searcher’s superiority is explained by the following mechanistic properties: > > - Decoupled Reasoning and Acquisition: Answer synthesis operates on the entire collected evidence, leveraging large model capacity while the planner specializes in modular tool orchestration. > > - Reward-Shaped, Modality-Balanced Search: RL-driven fine-tuning ensures balanced use of text and image retrieval despite pretrained biases. > > - Robustness to Transfer and Data Scarcity: M3^3Searcher maintains strong performance across domains, search engines, and answer generators, as shown by transfer results. > > Representative multimodal episodes (abridged): > > - For "What is the height of the building shown?": > 1. <think> "The image resembles the Eiffel Tower." </think> > 2. ImageSearch yields Eiffel Tower candidates. > 3. <think> "I’ll query the height." </think> > 4. TextSearch fetches a passage with "300 m." > 5. Final answer: "The Eiffel Tower is 300 meters tall." > > - For "Who designed the landmark in the picture?": > 1. <think> "Looks like the Leaning Tower of Pisa." </think> > 2. ImageSearch detects candidate. > 3. <think> "Query its architect." ``
    1. TextSearch returns "Bonanno Pisano."
    2. Answer: "The Leaning Tower of Pisa was designed by Bonanno Pisano."

This suggests that M3^3Searcher’s modular design and explicit reward shaping yield more coherent, trustworthy, and citation-backed output compared to approaches relying on a monolithic, end-to-end LLM context (Yu et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M$^3$Searcher.