How to Build Agentic Factual SFT and Mid‑Train Datasets: Query Selection, Trajectory Generation, and Tool Usage
This article outlines a systematic approach for creating agentic factual SFT and Mid‑train data, covering the definition of training goals, query filtering, two‑layer classification and labeling, trajectory format, differences between Mid‑train and SFT, a practical synthesis pipeline, and common pitfalls to avoid.
1. Define the training objective
Agentic factual ability is not about returning a static answer; it requires a verifiable, reproducible reasoning process: detect the question, decide whether to search, identify the needed evidence, assess evidence sufficiency, resolve conflicts, and finally answer only what the evidence supports.
2. Contrast ordinary factual QA with agentic factual data
Typical factual QA follows a simple question → answer pattern, e.g., "Who is the current CEO of Alibaba Group? → Wu Yongming." Agentic factual data, however, trains the model to perform a chain of actions:
question → search → observe evidence → judge → final answer
Example:
Question: "Who is the current CEO of Alibaba Group?"
Action: search("Alibaba Group current CEO official") Observation: The official Alibaba management page shows Wu Yongming as CEO.
Final: "As of the query time, Alibaba’s official site lists Wu Yongming as CEO."
The model learns that the answer is time‑sensitive, should prioritize official sources, include a time boundary, and never rely on memory alone.
3. Query filtering and value
Not all queries are useful for training. Simple factual questions like "What is the capital of China?" do not test agentic capabilities. Valuable queries require evidence selection, temporal relevance, or handling of ambiguous scopes, e.g.:
"Who is the current CEO of Alibaba Group?"
"Did Company X profit in 2023?"
"What is the Q4 revenue according to this announcement?"
"Has drug Y been approved for indication Z?"
"Was a large model released in 2024?"
These queries force the model to decide what to search, which sources are authoritative, and how to handle conflicting or insufficient evidence.
4. Two‑layer query classification and tagging
First layer – Question Type – determines the overall factual task, such as:
Timely factual (current CEO, latest version)
Document‑based QA (based on a given announcement)
Incorrect premise detection
Scope‑ambiguity judgment (profit vs EBITDA)
Conflict‑evidence handling
Insufficient‑evidence refusal
Second layer – Processing Tags – guides the downstream generation pipeline, e.g.:
Task category (which of the above)
Evidence scope (closed document vs open web)
Need to search? (true/false)
Require authoritative source? (true/false)
Recommended evidence sources (annual report, official announcement, regulatory database)
Trajectory generation strategy (e.g., first check net profit, then adjusted EBITDA)
Example JSON labeling for "Did Company X profit in 2023?":
{
"任务类别": "口径歧义判断类",
"证据范围": "开放检索",
"是否需要检索": true,
"是否需要权威来源": true,
"推荐证据源": ["年报", "公司公告", "交易所公告"],
"轨迹生成策略": "分别查净利润和调整后 EBITDA,最后分口径回答"
}5. Trajectory data format
A complete trajectory contains four core fields:
query:用户问题
类别:属于哪类事实任务
证据:从哪里查到什么
response:最终要训练的求证过程和答案Concrete example for the profit query:
{
"query": "某公司 2023 年是否盈利?",
"类别": "口径歧义判断类",
"证据": [
"E1: 年报显示归属于股东的净亏损为 12 亿元",
"E2: 公告称调整后 EBITDA 盈利"
],
"response": "先查年报净利润,再查调整后 EBITDA,最后回答:按净利润口径亏损,按调整后 EBITDA 口径盈利,不能简单说已经盈利。"
}6. Differences between Mid‑train and SFT data
Mid‑train data focuses on ability training: it is highly structured, emphasizes claim decomposition, evidence matching, stance judgment, conflict resolution, and reasoning steps. It does not need to mimic a real conversation.
SFT data emphasizes behavior alignment: it should resemble the assistant’s final user‑facing output, showing when to search, how to cite evidence, and how to give a restrained answer with clear boundaries.
7. Practical synthesis pipeline
Clean and deduplicate raw queries.
Apply the two‑layer classification and generate processing tags.
Construct evidence_pack (a curated set of reliable evidence snippets).
Generate trajectory samples using the evidence pack.
Score trajectories with a verifier model.
Write qualified samples to SFT or Mid‑train datasets.
The evidence pack is crucial; observations in the trajectory must come directly from it, not from the model’s memory.
8. Common pitfalls
Fabricating observations (e.g., claiming "the official site shows…" without real evidence).
Focusing only on the final answer and ignoring the reasoning process.
Always using the internet regardless of the query’s constraints (e.g., when the user explicitly asks to rely on a given announcement).
Answering confidently when evidence is insufficient; the model should downgrade or refuse.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
