How to Build Agentic Factual SFT and Mid‑Train Datasets: Query Selection, Trajectory Generation, and Tool Usage

This article outlines a systematic approach for creating agentic factual SFT and Mid‑train data, covering the definition of training goals, query filtering, two‑layer classification and labeling, trajectory format, differences between Mid‑train and SFT, a practical synthesis pipeline, and common pitfalls to avoid.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How to Build Agentic Factual SFT and Mid‑Train Datasets: Query Selection, Trajectory Generation, and Tool Usage

1. Define the training objective

Agentic factual ability is not about returning a static answer; it requires a verifiable, reproducible reasoning process: detect the question, decide whether to search, identify the needed evidence, assess evidence sufficiency, resolve conflicts, and finally answer only what the evidence supports.

2. Contrast ordinary factual QA with agentic factual data

Typical factual QA follows a simple question → answer pattern, e.g., "Who is the current CEO of Alibaba Group? → Wu Yongming." Agentic factual data, however, trains the model to perform a chain of actions:

question → search → observe evidence → judge → final answer

Example:

Question: "Who is the current CEO of Alibaba Group?"

Action: search("Alibaba Group current CEO official") Observation: The official Alibaba management page shows Wu Yongming as CEO.

Final: "As of the query time, Alibaba’s official site lists Wu Yongming as CEO."

The model learns that the answer is time‑sensitive, should prioritize official sources, include a time boundary, and never rely on memory alone.

3. Query filtering and value

Not all queries are useful for training. Simple factual questions like "What is the capital of China?" do not test agentic capabilities. Valuable queries require evidence selection, temporal relevance, or handling of ambiguous scopes, e.g.:

"Who is the current CEO of Alibaba Group?"

"Did Company X profit in 2023?"

"What is the Q4 revenue according to this announcement?"

"Has drug Y been approved for indication Z?"

"Was a large model released in 2024?"

These queries force the model to decide what to search, which sources are authoritative, and how to handle conflicting or insufficient evidence.

4. Two‑layer query classification and tagging

First layer – Question Type – determines the overall factual task, such as:

Timely factual (current CEO, latest version)

Document‑based QA (based on a given announcement)

Incorrect premise detection

Scope‑ambiguity judgment (profit vs EBITDA)

Conflict‑evidence handling

Insufficient‑evidence refusal

Second layer – Processing Tags – guides the downstream generation pipeline, e.g.:

Task category (which of the above)

Evidence scope (closed document vs open web)

Need to search? (true/false)

Require authoritative source? (true/false)

Recommended evidence sources (annual report, official announcement, regulatory database)

Trajectory generation strategy (e.g., first check net profit, then adjusted EBITDA)

Example JSON labeling for "Did Company X profit in 2023?":

{
  "任务类别": "口径歧义判断类",
  "证据范围": "开放检索",
  "是否需要检索": true,
  "是否需要权威来源": true,
  "推荐证据源": ["年报", "公司公告", "交易所公告"],
  "轨迹生成策略": "分别查净利润和调整后 EBITDA,最后分口径回答"
}

5. Trajectory data format

A complete trajectory contains four core fields:

query:用户问题
类别:属于哪类事实任务
证据:从哪里查到什么
response:最终要训练的求证过程和答案

Concrete example for the profit query:

{
  "query": "某公司 2023 年是否盈利?",
  "类别": "口径歧义判断类",
  "证据": [
    "E1: 年报显示归属于股东的净亏损为 12 亿元",
    "E2: 公告称调整后 EBITDA 盈利"
  ],
  "response": "先查年报净利润,再查调整后 EBITDA,最后回答:按净利润口径亏损,按调整后 EBITDA 口径盈利,不能简单说已经盈利。"
}

6. Differences between Mid‑train and SFT data

Mid‑train data focuses on ability training: it is highly structured, emphasizes claim decomposition, evidence matching, stance judgment, conflict resolution, and reasoning steps. It does not need to mimic a real conversation.

SFT data emphasizes behavior alignment: it should resemble the assistant’s final user‑facing output, showing when to search, how to cite evidence, and how to give a restrained answer with clear boundaries.

7. Practical synthesis pipeline

Clean and deduplicate raw queries.

Apply the two‑layer classification and generate processing tags.

Construct evidence_pack (a curated set of reliable evidence snippets).

Generate trajectory samples using the evidence pack.

Score trajectories with a verifier model.

Write qualified samples to SFT or Mid‑train datasets.

The evidence pack is crucial; observations in the trajectory must come directly from it, not from the model’s memory.

8. Common pitfalls

Fabricating observations (e.g., claiming "the official site shows…" without real evidence).

Focusing only on the final answer and ignoring the reasoning process.

Always using the internet regardless of the query’s constraints (e.g., when the user explicitly asks to rely on a given announcement).

Answering confidently when evidence is insufficient; the model should downgrade or refuse.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SFTAgentic AIdata synthesistrajectory generationevidence labelingmid-trainquery filtering
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.