How to Build Multi‑Step Reasoning Training Data for Deep Research Agents
Standard QA datasets fall short for deep research tasks because they lack the multi‑step, dynamic reasoning required; this article explains why, outlines four data‑construction techniques—SailorFog‑QA, WebFrontier, WebShaper, E2HQA—details trajectory sampling, filtering, scale considerations, and interview‑ready explanations.
Why Existing QA Datasets Are Insufficient
Public QA sets such as Natural Questions (NQ) and TriviaQA are single‑hop or at most two‑hop tasks; the answer can be extracted from a single paragraph without cross‑document reasoning. Deep Research requires 10–20 hops of reasoning, dynamic search adjustments, and the ability to continue probing when information is incomplete. Models trained on single‑step QA data cannot learn the “search‑then‑refine” behavior needed for deep research.
Four Main Data‑Construction Methods
Method 1: SailorFog‑QA – Knowledge‑Graph Random Walk
Complex questions are mapped to paths in a knowledge graph. Pipeline:
Extract entities from Wikipedia or a domain‑specific KG and build a graph where nodes are entities and edges are relations.
Perform a random walk to generate a path; the path length determines the number of reasoning hops.
Use an LLM to convert the path into a natural‑language question, ensuring the answer is the terminal node.
Advantages: high answer verifiability because the answer is guaranteed to be the endpoint of a known graph path. SailorFog V2 adds “orbit‑node fuzzing” to make intermediate nodes less explicit.
def generate_sailorfog_qa(graph, min_hops=3, max_hops=6):
"""Generate multi‑hop QA from a knowledge‑graph path"""
start_node = random.choice(list(graph.nodes))
path = [start_node]
current = start_node
target_hops = random.randint(min_hops, max_hops)
for _ in range(target_hops):
neighbors = list(graph.neighbors(current))
if not neighbors:
break
current = random.choice(neighbors)
path.append(current)
question = llm.path_to_question(path, graph)
answer = path[-1]
return {"question": question, "answer": answer, "hops": len(path) - 1}Method 2: WebFrontier – Iterative Complexity Upgrade
Start from a small seed QA set and repeatedly apply four upgrade operations to increase difficulty:
Entity Replacement : swap answer entities with rarer variants.
Condition Augmentation : add temporal, geographic, or other constraints.
Comparison Merge : combine two independent questions into a comparative query.
Negation Inversion : rephrase a positive question into a negative form.
Each upgrade increments a complexity level, producing a graded curriculum from easy to hard. The method scales well because only a few seed questions are needed and the upgrade chain is fully traceable.
Method 3: WebShaper – Formal Reasoning‑Chain Control
Define a formal “knowledge projection” language that explicitly describes each reasoning step, then generate questions that follow the specified chain. Example projection sequence:
Projection P1: from "Company X" to "CEO of X"
Projection P2: from "CEO of X" to "the school they graduated from"
Projection P3: from "the school" to "its current president"The resulting question is: “Who is the current president of the school that the CEO of Company X graduated from?” This method provides complete control over the number and type of reasoning steps, enabling a uniformly distributed difficulty curriculum.
Method 4: E2HQA – Easy‑to‑Hard Entity Replacement
Begin with simple QA pairs and iteratively replace entities with more obscure references while keeping the answer unchanged. Example:
Original: “Which famous universities are in Beijing?” → Answer: “Peking University, Tsinghua University…”
First replacement: replace “Beijing” with “the city whose abbreviation matches the most concentrated IT hub in China”.
Second replacement: replace with “the city that launched the earliest domestic chip strategy”.
Pros: fast generation and low implementation cost. Cons: limited question variety and mechanically incremental difficulty.
Trajectory Sampling: From Questions to Full Reasoning Paths
Beyond QA pairs, models need complete reasoning trajectories (Thought → Action → Observation → Answer). The process:
Teacher Model Generation : Use a strong LLM (e.g., Claude, GPT‑4o, DeepSeek V3) to generate 4–8 candidate trajectories per question, recording every intermediate step.
Three‑Stage Funnel Filtering :
Format Validation : Ensure each trajectory follows the required markup (e.g., <think>…</think>, <tool_call>…</tool_call>, <observation>…</observation>, <answer>…</answer>). Non‑conforming samples are discarded.
Correctness Validation : The final answer must match the ground‑truth; otherwise the trajectory is excluded.
Quality Assessment : Evaluate step count (reasonable range), necessity of each tool call, and clarity of the reasoning chain. Only trajectories passing all three checks are kept.
Typically 30–50 % of raw trajectories survive this pipeline.
def filter_trajectories(trajectories: list[dict]) -> list[dict]:
"""Three‑stage funnel filtering for reasoning trajectories"""
# Stage 1: format validation
format_valid = [t for t in trajectories if validate_format(t)]
print(f"Format pass: {len(format_valid)}/{len(trajectories)}")
# Stage 2: correctness validation
correct = [t for t in format_valid if verify_answer(t["final_answer"], t["ground_truth"])]
print(f"Answer correctness: {len(correct)}/{len(format_valid)}")
# Stage 3: quality assessment
high_quality = []
for traj in correct:
score = evaluate_quality(traj)
if score["step_count_ok"] and score["no_repeat"] and score["reasoning_clear"]:
high_quality.append(traj)
print(f"Quality pass: {len(high_quality)}/{len(correct)}")
return high_qualityData Scale and Distribution
For the cold‑start SFT phase, 1,000–3,000 high‑quality trajectories are sufficient; quality outweighs quantity. A practical split is 1,200 SFT trajectories (≈38 % retention after filtering) and 3,000 RL‑generated trajectories.
Complexity distribution should be balanced: 40 % 3–5 hops, 40 % 6–10 hops, 20 % >10 hops. Task‑type diversity is also crucial—include factual verification, comparative analysis, comprehensive reporting, and time‑constrained queries, each roughly equally represented.
Using Negative Samples Wisely
Trajectories filtered out for being incorrect or inefficient can serve as negative examples in reinforcement‑learning (RL) training, helping the model learn to avoid bad reasoning patterns. However, trajectories that fail format validation (e.g., missing a closing <answer> tag) must be discarded entirely, as they teach the model to truncate outputs prematurely.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
