DLLM-Searcher: Diffusion Model for Efficient Search

Updated 10 February 2026

DLLM-Searcher is an optimization framework that enhances diffusion LLMs with agentic post-training and blockwise inference to enable efficient search tasks.
It integrates a two-stage process—Agentic SFT and VRPO—to improve multi-step reasoning and tool_call accuracy while addressing latency in ReAct-style search agents.
Empirical results show DLLM-Searcher matches or exceeds state-of-the-art search agents, achieving similar accuracy with approximately 15% reduced wall-clock latency.

DLLM-Searcher is an optimization framework enabling Diffusion LLMs (dLLMs) to serve as efficient, high-accuracy backbone models for search agents operating under the ReAct (Reasoning+Acting) agent paradigm. It addresses both the inherent agent ability deficiency of raw dLLM backbones and the end-to-end latency limitations of serial multi-turn agent operation. Its design combines a diffusion-based training regime specialized for agentic tasks, diffusion-specific preference optimization, and a novel block-level, tool-prioritized inference protocol termed Parallel-Reasoning and Acting (P-ReAct). Empirically, DLLM-Searcher achieves or exceeds the accuracy of competitive autoregressive search agents while reducing wall-clock latency by approximately 15% without any accuracy loss (Zhao et al., 3 Feb 2026).

1. Diffusion LLMs and the Search-Agent Context

Diffusion LLMs (dLLMs) differ from autoregressive models (ARMs) by their blockwise, order-free decoding: given source input $x$ and target sequence $y=(y^1,\ldots,y^L)$ , a discrete diffusion forward process randomly masks tokens, and the denoising model learns to iteratively predict and reconstruct masked positions in parallel. The canonical training objective is a negative ELBO:

$\mathcal{L}^{\rm block}_\theta(y\mid x) = \mathbb{E}_{t\sim\mathcal{U}[0,1],\,y_t\sim q_t}\left[\frac{1}{t} \sum_{k=1}^K\sum_{i=1}^B \mathbf{1}[y_t^{k,i}=[M]] \log p_\theta(y^{k,i}|y_t^k,y^{<k},x)\right]$

where positions $y_t^{k,i}$ are masked with probability $t$ . This paradigm intrinsically enables potential efficiency gains via non-causal, parallel token filling.

Search agents structured in the ReAct framework interleave “thought” (planning) and “tool_call” (external API invocation) regions. Traditionally, the serial nature of generating a plan, issuing a tool call, awaiting the response, and resuming reasoning induces high end-to-end latency, especially as tool API wait times are non-negligible.

2. Agent Ability and Latency Challenges

Two principal challenges prevent naïve adoption of dLLMs in search-agent settings:

Agent Ability Challenge: Vanilla dLLMs are weak at generating format-compliant “tool_call” commands and performing multi-step logical reasoning chaining multiple tool invocations, compared to strong ARMs. This is due to insufficient supervised grounding in the required agentic behaviors.
Latency Challenge: The strictly sequential nature of the canonical ReAct protocol means the model idles while waiting for tool responses, failing to exploit the order-insensitive, parallel generation capability of dLLMs (Zhao et al., 3 Feb 2026).

3. Two-Stage Agentic Post-Training

DLLM-Searcher applies two specialized diffusion-based post-training stages to overcome agentic weakness:

Agentic Supervised Fine-Tuning (Agentic SFT):
- Supervised data are constructed from queries (e.g., HotpotQA, 2Wiki, Musique datasets) with ReAct-style trajectories generated by a strong ARM teacher (Doubao-Seed-1.8).
- Only Think and Tool_Call blocks are masked using the diffusion forward process; Tool_Response blocks are either fully masked or kept intact.
- The reconstruction (Agentic ELBO) loss is focused on masked Think or Tool_Call spans:
$\hat{\mathcal L}^{\rm block}_\theta(y\mid x) = \mathbb{E}_{t,y_t\sim\hat q_t}\left[\frac{1}{t}\sum_{k,i} \mathbf{1}[y_t^{k,i}=[M] \wedge y^{k,i} \notin \text{Response}] \log p_\theta(y^{k,i}|y_t^k,y^{<k},x)\right]$ - The resulting model more reliably learns the structural constraints and reasoning logic required of agentic operation.
Agentic Variance-Reduced Preference Optimization (Agentic VRPO):
- Preference data consists of paired rollouts from the SFT model under the P-ReAct sampling strategy, with pairs labeled by correct (winner) versus incorrect (loser) final answer, both adhering to format constraints.
- The diffusion-advantage (ELBO advantage) is computed for each trajectory relative to a fixed reference model:
$\Delta\mathcal L(y\mid x) = \hat{\mathcal L}^{\rm block}_\theta(y\mid x) - \hat{\mathcal L}^{\rm block}_{\rm ref}(y\mid x)$ - A pairwise logistic loss is minimized:

$\mathcal L_{\rm VRPO}(\theta) = -\,\mathbb{E}_{(x,y_w,y_\ell)} \left[\log\sigma\left(\beta[\Delta\mathcal L(y_w\mid x) - \Delta\mathcal L(y_\ell\mid x)]\right)\right]$

further reinforcing multi-step reasoning fidelity and tool_call accuracy.

4. Parallel-Reasoning and Acting (P-ReAct) Inference Protocol

To address sequential agent-processing bottlenecks, DLLM-Searcher introduces P-ReAct, a blockwise decoding and execution protocol leveraging order-invariance and position control in diffusion models:

Special-Token Pre-filling: Positions reserved for tool_call JSON are pre-filled with corresponding special tokens, enforcing region boundaries.
Confidence Biasing: At every denoising step, the confidence for masked tool_call positions receives a positive bias $\alpha$ . As a result, tool_call regions are filled prior to Think regions, with high reliability.
Overlapping Reasoning and API Calls: Once the tool_call is completed, the API request is fired immediately, overlapping external response wait times with continuing Think-region denoising. The agent thus “keeps thinking while waiting,” a pattern infeasible in ARMs due to their left-to-right decoding constraint (Zhao et al., 3 Feb 2026).

5. Experimental Results

Empirical evaluation is conducted on HotpotQA, 2Wiki, Bamboogle, and Musique multi-hop QA datasets, focusing on two metrics:

ACC_R: Exact-match retrieval—produced retrieval contains the ground-truth answer span.
ACC_L: LLM-as-judge accuracy (judged using Doubao-Seed-1.8).

Model	HotpotQA ACC_R	2Wiki ACC_R	Bamboogle ACC_R	Musique ACC_R
Search-R1	49.6	46.0	47.2	28.0
R1Searcher*	58.0	59.6	66.4	28.2
SDAR (vanilla)	—	—	—	—
DLLM-Searcher	60.4	69.8	68.8	29.0

DLLM-Searcher achieves parity or modest improvement over non-diffusion agents on three of four datasets and matches state-of-the-art LLM agent accuracy (Zhao et al., 3 Feb 2026).

P-ReAct yields 15% mean reduction in response latency without accuracy degradation, with measured reductions of 14.8–22.1% across datasets. Attempts to re-order ARMs for tool-first decoding degrade accuracy by 3–7 points, underscoring the unique suitability of dLLMs for such decoding schedules.

6. Ablations, Limitations, and Future Directions

Ablation studies demonstrate that Agentic SFT alone improves vanilla SDAR from 0% to ~57–66% ACC_R, and Agentic VRPO delivers an additional 3–5 point gain. P-ReAct operates at ~78–85% of baseline latency relative to standard ReAct, with no loss in ACC_R. Crucially, the tool-first blockwise decoding—infeasible for ARMs—actually improves or maintains accuracy for dLLMs.

Noted limitations include the labor- and resource-intensive nature of curating high-quality agentic teacher trajectories, restriction to web-search tool use only, and evaluation under block sizes of 128. Extensions may include support for heterogeneous APIs (calculators, vision), adaptive block scheduling for long contexts, and experimentation with advanced RL objectives for dLLMs (Zhao et al., 3 Feb 2026).

7. Comparative Perspective and Broader Implications

DLLM-Searcher provides the first demonstration that diffusion models, when augmented with targeted agentic training and inference procedures, can fully realize their theoretical generation speed advantage in real-world agentic search settings, matching or exceeding autoregressive backbones in accuracy while providing significant latency reductions. The framework’s modularity—comprising diffusion-specific loss targeting, preference optimization, and controlled blockwise inference—offers a generalizable template for extending dLLM-based agents to more complex, multi-tool, or multi-modal scenarios. This suggests diffusion architectures, when equipped with the requisite “agentic” adaptations, are viable for low-latency, high-stakes reasoning-agent applications in research and production (Zhao et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DLLM-Searcher.