Enabling Search Agents to Think While Waiting: Diffusion LLMs Deliver 15% Faster Inference Without Accuracy Loss

The paper introduces DLLM‑Searcher, which equips diffusion large language models with a two‑stage training pipeline and a P‑ReAct inference scheme, allowing the model to issue tool calls while simultaneously reasoning, yielding 14‑22% end‑to‑end speedup and matching or surpassing traditional autoregressive agents on multi‑hop QA benchmarks.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Enabling Search Agents to Think While Waiting: Diffusion LLMs Deliver 15% Faster Inference Without Accuracy Loss

Problem: Serial reasoning in ReAct search agents

Traditional ReAct agents (e.g., Search‑R1, R1Searcher) follow a strictly serial loop: think → call tool → wait for result → think again. During the waiting period the model is idle, causing large end‑to‑end latency in multi‑hop QA tasks.

Why diffusion LLMs can parallelize

Diffusion large language models (dLLM) generate all token positions simultaneously by denoising a latent mosaic, giving two properties:

Free generation order: the model can decode the most important tokens first and fill in the rest later.

Pre‑thinking: bidirectional attention allows tool‑call tokens to be generated while the “thought” part remains masked.

The paper quotes that “diffusion models know the answer before decoding.” A naïve use of dLLM in ReAct on HotpotQA (500 questions) yields 0 % success: the model either outputs an EOS token, forgets to call the tool, or produces malformed tool calls.

Two‑stage training to make dLLM an effective search agent

Stage 1 – Agentic SFT (Supervised Fine‑Tuning): A strong seed model (豆包 Seed‑1.8) generates standard search trajectories; after filtering out incorrect or malformed examples, 3 977 high‑quality trajectories remain for dLLM. The challenge is that trajectories contain both the model’s “thought” and the tool’s response; the model should learn the former but not memorize the latter.

To prevent leaking future search results into the training context, the authors introduce Agentic Noising , which adds noise only to the “thought” and “tool‑call” segments while either preserving the tool‑response verbatim or masking it entirely. The accompanying Agentic ELBO loss is computed only on the noisy positions, ensuring gradients do not flow through the response tokens.

Stage 2 – Agentic VRPO (Variance‑Reduced Preference Optimization): The SFT‑trained model runs two passes on the data, selecting pairs where one answer is correct and the other is wrong. Preference learning on these pairs further separates correct from incorrect reasoning paths, improving accuracy by >3 % on all evaluated datasets.

P‑ReAct: Parallel reasoning and acting without extra training

During inference the model needs a mechanism to prioritize tool calls. P‑ReAct achieves this by:

Pre‑filling boundary markers so that the model knows the exact region where a tool call should appear.

Adding a positive bias (α = 0.5) to the confidence scores of tokens in the tool‑call region, causing the decoder to select those tokens first.

Consequently, the model almost always completes the tool call before any other token, sends the request to the search engine, and then continues filling the masked “thought” region while waiting for the result.

Empirical results

On four multi‑hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, Bamboogle, Musique) DLLM‑Searcher achieves an average right‑answer accuracy (ACC_R) of 57.0 % versus left‑answer accuracy (ACC_L) of 56.6 %, surpassing traditional RAG methods and matching or slightly exceeding the autoregressive R1Searcher.

With P‑ReAct, inference speed improves by 14.77 %–22.08 % end‑to‑end with negligible accuracy loss. Forcing autoregressive models (Qwen‑3 series) to output tool calls before thoughts leads to a significant drop in accuracy, confirming that “think‑first‑act‑later” is a structural advantage of diffusion models.

DLLM‑Searcher uses fewer than 8 000 training examples yet attains 68.8 % on the out‑of‑domain Bamboogle dataset, demonstrating strong generalization.

Implications

Targeted training enables diffusion LLMs to match autoregressive reasoning ability while exploiting parallel generation to keep thinking while waiting for external tool results, opening a new avenue for accelerating tool‑augmented agents.

Reference: “DLLM‑Searcher: Adapting Diffusion Large Language Models for Search Agents” (arXiv:2602.07035). Code repository: https://github.com/bubble65/DLLM-Searcher

agentic trainingdiffusion LLMparallel reasoningMulti-hop QAP-ReActsearch agents
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.