How to Build Multi‑Step Reasoning Training Data for Deep Research Agents

Standard QA datasets fall short for deep research tasks because they lack the multi‑step, dynamic reasoning required; this article explains why, outlines four data‑construction techniques—SailorFog‑QA, WebFrontier, WebShaper, E2HQA—details trajectory sampling, filtering, scale considerations, and interview‑ready explanations.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
How to Build Multi‑Step Reasoning Training Data for Deep Research Agents

Why Existing QA Datasets Are Insufficient

Public QA sets such as Natural Questions (NQ) and TriviaQA are single‑hop or at most two‑hop tasks; the answer can be extracted from a single paragraph without cross‑document reasoning. Deep Research requires 10–20 hops of reasoning, dynamic search adjustments, and the ability to continue probing when information is incomplete. Models trained on single‑step QA data cannot learn the “search‑then‑refine” behavior needed for deep research.

Four Main Data‑Construction Methods

Method 1: SailorFog‑QA – Knowledge‑Graph Random Walk

Complex questions are mapped to paths in a knowledge graph. Pipeline:

Extract entities from Wikipedia or a domain‑specific KG and build a graph where nodes are entities and edges are relations.

Perform a random walk to generate a path; the path length determines the number of reasoning hops.

Use an LLM to convert the path into a natural‑language question, ensuring the answer is the terminal node.

Advantages: high answer verifiability because the answer is guaranteed to be the endpoint of a known graph path. SailorFog V2 adds “orbit‑node fuzzing” to make intermediate nodes less explicit.

def generate_sailorfog_qa(graph, min_hops=3, max_hops=6):
    """Generate multi‑hop QA from a knowledge‑graph path"""
    start_node = random.choice(list(graph.nodes))
    path = [start_node]
    current = start_node
    target_hops = random.randint(min_hops, max_hops)
    for _ in range(target_hops):
        neighbors = list(graph.neighbors(current))
        if not neighbors:
            break
        current = random.choice(neighbors)
        path.append(current)
    question = llm.path_to_question(path, graph)
    answer = path[-1]
    return {"question": question, "answer": answer, "hops": len(path) - 1}

Method 2: WebFrontier – Iterative Complexity Upgrade

Start from a small seed QA set and repeatedly apply four upgrade operations to increase difficulty:

Entity Replacement : swap answer entities with rarer variants.

Condition Augmentation : add temporal, geographic, or other constraints.

Comparison Merge : combine two independent questions into a comparative query.

Negation Inversion : rephrase a positive question into a negative form.

Each upgrade increments a complexity level, producing a graded curriculum from easy to hard. The method scales well because only a few seed questions are needed and the upgrade chain is fully traceable.

Method 3: WebShaper – Formal Reasoning‑Chain Control

Define a formal “knowledge projection” language that explicitly describes each reasoning step, then generate questions that follow the specified chain. Example projection sequence:

Projection P1: from "Company X" to "CEO of X"
Projection P2: from "CEO of X" to "the school they graduated from"
Projection P3: from "the school" to "its current president"

The resulting question is: “Who is the current president of the school that the CEO of Company X graduated from?” This method provides complete control over the number and type of reasoning steps, enabling a uniformly distributed difficulty curriculum.

Method 4: E2HQA – Easy‑to‑Hard Entity Replacement

Begin with simple QA pairs and iteratively replace entities with more obscure references while keeping the answer unchanged. Example:

Original: “Which famous universities are in Beijing?” → Answer: “Peking University, Tsinghua University…”

First replacement: replace “Beijing” with “the city whose abbreviation matches the most concentrated IT hub in China”.

Second replacement: replace with “the city that launched the earliest domestic chip strategy”.

Pros: fast generation and low implementation cost. Cons: limited question variety and mechanically incremental difficulty.

Comparison of four data construction methods
Comparison of four data construction methods

Trajectory Sampling: From Questions to Full Reasoning Paths

Beyond QA pairs, models need complete reasoning trajectories (Thought → Action → Observation → Answer). The process:

Teacher Model Generation : Use a strong LLM (e.g., Claude, GPT‑4o, DeepSeek V3) to generate 4–8 candidate trajectories per question, recording every intermediate step.

Three‑Stage Funnel Filtering :

Format Validation : Ensure each trajectory follows the required markup (e.g., <think>…</think>, <tool_call>…</tool_call>, <observation>…</observation>, <answer>…</answer>). Non‑conforming samples are discarded.

Correctness Validation : The final answer must match the ground‑truth; otherwise the trajectory is excluded.

Quality Assessment : Evaluate step count (reasonable range), necessity of each tool call, and clarity of the reasoning chain. Only trajectories passing all three checks are kept.

Typically 30–50 % of raw trajectories survive this pipeline.

def filter_trajectories(trajectories: list[dict]) -> list[dict]:
    """Three‑stage funnel filtering for reasoning trajectories"""
    # Stage 1: format validation
    format_valid = [t for t in trajectories if validate_format(t)]
    print(f"Format pass: {len(format_valid)}/{len(trajectories)}")
    # Stage 2: correctness validation
    correct = [t for t in format_valid if verify_answer(t["final_answer"], t["ground_truth"])]
    print(f"Answer correctness: {len(correct)}/{len(format_valid)}")
    # Stage 3: quality assessment
    high_quality = []
    for traj in correct:
        score = evaluate_quality(traj)
        if score["step_count_ok"] and score["no_repeat"] and score["reasoning_clear"]:
            high_quality.append(traj)
    print(f"Quality pass: {len(high_quality)}/{len(correct)}")
    return high_quality
Three‑stage trajectory filtering diagram
Three‑stage trajectory filtering diagram

Data Scale and Distribution

For the cold‑start SFT phase, 1,000–3,000 high‑quality trajectories are sufficient; quality outweighs quantity. A practical split is 1,200 SFT trajectories (≈38 % retention after filtering) and 3,000 RL‑generated trajectories.

Complexity distribution should be balanced: 40 % 3–5 hops, 40 % 6–10 hops, 20 % >10 hops. Task‑type diversity is also crucial—include factual verification, comparative analysis, comprehensive reporting, and time‑constrained queries, each roughly equally represented.

Using Negative Samples Wisely

Trajectories filtered out for being incorrect or inefficient can serve as negative examples in reinforcement‑learning (RL) training, helping the model learn to avoid bad reasoning patterns. However, trajectories that fail format validation (e.g., missing a closing <answer> tag) must be discarded entirely, as they teach the model to truncate outputs prematurely.

AI agentsLLM trainingdata constructionMulti-step Reasoningtrajectory sampling
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.