Why Cleaning SFT Data Is a Nightmare: Hidden JSON Formatting Pitfalls
Cleaning SFT data for LLMs is surprisingly complex, as subtle JSON formatting variations, inconsistent markdown wrappers, intent settings, and unit handling can cause model inconsistencies, requiring unified standards, careful prompt design, and extensive manual review to ensure reliable training outputs.
Cleaning supervised fine‑tuning (SFT) data for large language models is far more tedious than it appears; data quickly becomes outdated, so even last year's high‑quality datasets must be revisited and re‑annotated each year.
For example, a user asks whether to choose a cat or a dog as a pet. In 2023 GPT‑4 replied with a safe, generic answer, while in 2024 the same model added more informal suggestions, illustrating how model updates can render previous responses stale and necessitate re‑labeling.
The core difficulty lies in standardizing JSON output formats. Variations include:
Markdown‑wrapped JSON (surrounded by ```json and ```).
Different intent values (e.g., 0, 2, 4) that control indentation or line breaks.
JSONL format, where each JSON object occupies a single line.
Requests that forbid markdown rendering.
Prompts that demand "analyze first, then output JSON" or "output only JSON and nothing else".
To manage these, two practical steps are recommended:
Choose a single canonical JSON style (e.g., with or without markdown, a fixed intent value, single‑line or multi‑line) and convert every SFT example to that style.
Augment prompts with explicit instructions so the model consistently follows the chosen format.
When a few‑shot example contains malformed JSON, the model may either mimic the error or enforce proper syntax. Testing five leading providers (GPT‑4o, Doubao, Kimi, Qwen, and Wenxin) shows divergent behaviors:
GPT‑4o: markdown + standard JSON + intent 2
Doubao: standard JSON + intent 0
Kimi: markdown + few‑shot JSON + JSONL
Qwen: few‑shot JSON + JSONL
Wenxin: markdown + standard JSON + intent 4
The takeaway is that any JSON style works for training as long as it is applied uniformly; inconsistency harms both model learning and downstream batch requests.
Another subtle issue is numeric representation. Using raw float or int values can omit essential units (e.g., "million $"), reducing accuracy for financial tasks. To preserve units, add an explicit unit field in the JSON output, as shown below:
{
"2018": {
"net_profit": 5678,
"unit": "million$"
}
}Additional challenges include escaping dollar signs in LaTeX expressions, handling Chinese punctuation, and deciding whether to insert commas in large numbers. These details often require bulk text processing scripts.
In summary, the most critical lesson is not the specific JSON format but the strict consistency of that format across the entire SFT dataset.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
