Artificial Intelligence 32 min read

Why Public QA Datasets Fail for Deep Research Agents—and How to Build Effective Training Data

The article explains that single‑ or two‑hop QA datasets cannot teach Deep Research agents multi‑step reasoning, outlines four mainstream data‑construction methods, describes trajectory sampling with a three‑stage funnel filter, and shares practical guidelines on data volume, difficulty distribution, question types, and common pitfalls.

Wu Shixiong's Large Model Academy

Jun 24, 2026

Why Public QA Datasets Fail for Deep Research Agents—and How to Build Effective Training Data

1. Task‑Complexity Gap: Why Existing QA Datasets Are Unsuitable

Many public QA datasets such as Natural Questions (NQ) and TriviaQA are single‑hop tasks: the answer can be extracted from a single paragraph, so the model learns the behavior "search once, find the answer, stop." HotpotQA requires two hops, but still far below the ten‑to‑twenty‑hop reasoning needed for Deep Research, e.g., analyzing a company's competitive landscape, which involves multiple searches, comparisons, and handling contradictory information.

This structural gap means that training on single‑ or two‑hop data never exposes the model to "continue digging when information is insufficient," so the model cannot learn the desired multi‑step behavior.

In early experiments, using a large batch of HotpotQA and single‑hop QA for supervised fine‑tuning (SFT) produced a model that performed well on simple questions but, when faced with genuine multi‑step tasks, would either answer prematurely or fabricate answers without further searching. The root cause was the lack of training examples that required repeated searches.

2. Four Mainstream Data‑Construction Methods

Method 1: SailorFog‑QA – Random Walks on a Knowledge Graph

The key insight is that a complex question corresponds to a complex path in a knowledge graph. The process has three steps:

Extract entities from Wikipedia or a domain‑specific knowledge base to build a graph where nodes are entities and edges are relations.

Perform a random walk on the graph; the number of hops determines the required reasoning depth.

Use a strong LLM to translate the walked path into a natural‑language question whose answer is the terminal node.

Advantages: extremely verifiable answers because the path itself is the source. An advanced version, SailorFog V2, introduces "orbit‑node fuzzification" to make intermediate nodes less explicit, forcing the model to resolve ambiguous descriptions.

Drawback: the quality of generated questions depends heavily on the underlying graph; noisy or incorrect edges produce questions that are correct on the graph but wrong in reality.

def generate_sailorfog_qa(graph, min_hops=3, max_hops=6):
    """Random walk on a knowledge graph to generate multi‑hop QA. The path endpoint is the answer."""
    current = random.choice(list(graph.nodes))
    path = [current]
    target_hops = random.randint(min_hops, max_hops)
    for _ in range(target_hops):
        neighbors = list(graph.neighbors(current))
        if not neighbors:
            break
        current = random.choice(neighbors)
        path.append(current)
    question = llm.path_to_question(path, graph)
    return {"question": question, "answer": path[-1], "hops": len(path) - 1}

Task‑complexity gap: from single‑hop QA to Deep Research

Method 2: WebFrontier – Iterative Upgrade from Seed QA

Starting from a small set of simple seed QA, four deterministic operations are repeatedly applied to raise difficulty:

Entity Replacement : swap answer entities with rarer variants.

Condition Augmentation : add temporal, geographic, or other constraints.

Comparison Merge : combine two independent questions into a comparative one.

Negation Reversal : turn a positive question into a "which of the following is NOT …" format.

Each upgrade step records the transformation chain, enabling a clear difficulty label and traceability. The method scales well because a few seeds can generate a large, graded dataset, and the provenance helps diagnose model weaknesses later.

The full upgrade pipeline and prompt templates are available in the author's public repository.

WebFrontier: iterative upgrade from seed QA

Method 3: WebShaper – Formal Reasoning‑Chain Projection

Each question is described by a formal sequence of knowledge‑projection operations, e.g.

Projection P1: from "Company X" project to "CEO of X"
Projection P2: from "CEO of X" project to "his alma mater"
Projection P3: from "alma mater" project to "current president"

The LLM then converts this chain into a natural‑language question such as "Who is the current president of the school where the CEO of Company X graduated?" This method gives precise control over the number of reasoning steps and the type of each step, supporting curriculum learning. The downside is higher design cost compared with WebFrontier.

Method 4: E2HQA (Easy‑to‑Hard QA) – Incremental Entity Replacement

Starting from a simple QA, successive rounds of entity replacement increase difficulty while keeping the answer unchanged. Example: "Which famous universities are in Beijing?" → replace "Beijing" with "the city with the most concentrated IT industry in China" → further replace with "the city that first proposed an independent chip strategy in China". The answer remains the same, but the reasoning depth grows.

Advantages: fast generation, low cost, and inherent answer correctness. Drawbacks: limited question variety and risk of the model learning shortcuts by reverse‑engineering the replaced entity.

3. Trajectory Sampling and Three‑Stage Funnel Filtering

Training data must include not only question‑answer pairs but also the full reasoning trajectory: the sequence of thoughts, tool calls, observations, and the final answer. A strong Teacher model (e.g., Claude, GPT‑4o, DeepSeek V3) generates multiple candidate trajectories per question (typically 4‑8).

Filtering proceeds in three increasingly strict layers:

Format Validation : ensure the trajectory contains correctly nested tags for thought, tool call, observation, and answer.

Correctness Validation : discard trajectories whose final answer does not match the ground‑truth.

Quality Evaluation : keep only trajectories with reasonable step counts, necessary tool calls, and clear reasoning.

In practice, about 38 % of raw trajectories survive all three stages.

def filter_trajectories(trajectories: list[dict]) -> list[dict]:
    """Three‑stage funnel filtering: format → correctness → quality."""
    stage1 = [t for t in trajectories if validate_format(t)]
    stage2 = [t for t in stage1 if verify_answer(t["final_answer"], t["ground_truth"])]
    stage3 = []
    for t in stage2:
        s = evaluate_quality(t)
        if s["step_count_ok"] and s["no_repeat"] and s["reasoning_clear"]:
            stage3.append(t)
    print(f"Funnel: {len(trajectories)} → {len(stage1)} → {len(stage2)} → {len(stage3)}")
    return stage3

Three‑stage funnel filtering of trajectories

4. Data Composition: Quantity, Difficulty Distribution, and Question Types

For a cold‑start SFT phase, 1 000–3 000 high‑quality trajectories are sufficient; the goal is to teach the model the correct format and basic multi‑step reasoning posture. Quality outweighs quantity: 100 carefully curated samples can outperform 1 000 noisy ones. The author's project used 1 200 trajectories for SFT and later added 3 000 online‑sampled trajectories for RL.

Difficulty should be balanced: 40 % of questions with 3‑5 hops, 40 % with 6‑10 hops, and 20 % with >10 hops. This curriculum‑style progression stabilises convergence.

Question types must be diverse: factual verification, comparative analysis, comprehensive report generation, and time‑constrained queries. Even distribution across these types improves generalisation.

Data recipe: volume, difficulty gradient, question‑type diversity

5. Verifying the Utility of Constructed Data

Path length alone is a coarse difficulty proxy. Instead, calibrate difficulty by sampling the baseline model multiple times (e.g., 16) and measuring success rate. High success → easy; very low → too hard or broken; middle range → most valuable for training.

def calibrate_difficulty(question, model, n_samples=16):
    """Use multi‑sample success rate to re‑label difficulty, far more reliable than hop count."""
    successes = sum(verify_answer(model.solve(question), question["answer"]) for _ in range(n_samples))
    rate = successes / n_samples
    if rate >= 0.8:
        return "too_easy"
    if rate <= 0.05:
        return "too_hard_or_broken"
    return "good"

This also surfaces broken questions whose answers are wrong or whose knowledge‑graph paths are unrealistic.

Another validation step is a small‑scale probe training: fine‑tune on a tiny subset and monitor a fixed validation set. If a new batch of data degrades metrics, discard it before full‑scale training.

6. Two Common Pitfalls

Pitfall 1: Discarded trajectories as negative samples. Trajectories filtered out for quality can be reused as negative examples during RL, provided they are not malformed (format‑broken trajectories must never be used, as they teach the model to truncate prematurely).

Pitfall 2: Data contamination. If generated training questions overlap semantically with benchmark evaluation sets, the model may appear to perform well by memorising answers. Prevent this by semantic deduplication: compute embedding similarity between each training question and every evaluation question, and drop those exceeding a threshold.

Semantic deduplication to avoid evaluation contamination

7. How to Answer This Question in an Interview

When asked "Where do your Deep Research training data come from?", follow four steps:

Highlight the task‑complexity gap (≈20 s).

Describe the main construction method you used, e.g., WebFrontier for scalable gradient data and SailorFog‑QA for high‑verifiability hard questions (≈30 s).

Explain the quality‑control pipeline: teacher‑generated trajectories + three‑stage funnel, with a ~38 % retention rate (≈30 s).

State the data composition (≈20 s): 1 200 SFT trajectories, 4:4:2 difficulty split, and the two pitfalls about negative samples and contamination.

Be prepared for follow‑up questions on difficulty calibration, required data volume, and evaluation‑set de‑duplication.

Conclusion

This article completes the first pillar of Deep Research training: data. It shows that data quality, not model architecture, determines the ceiling of agent capabilities. Four construction methods (SailorFog‑QA, WebFrontier, WebShaper, E2HQA) address multi‑step question generation; trajectory sampling with a three‑stage funnel guarantees clean samples; and careful composition and de‑duplication secure a solid training foundation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Filtering Data Construction Deep Research Trajectory Sampling AI Agent Training Multi-step QA

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.