Artificial Intelligence 8 min read

Why Cleaning SFT Data Is a Nightmare: Hidden JSON Formatting Pitfalls

Cleaning SFT data for LLMs is surprisingly complex, as subtle JSON formatting variations, inconsistent markdown wrappers, intent settings, and unit handling can cause model inconsistencies, requiring unified standards, careful prompt design, and extensive manual review to ensure reliable training outputs.

Baobao Algorithm Notes

Nov 13, 2024

Why Cleaning SFT Data Is a Nightmare: Hidden JSON Formatting Pitfalls

Cleaning supervised fine‑tuning (SFT) data for large language models is far more tedious than it appears; data quickly becomes outdated, so even last year's high‑quality datasets must be revisited and re‑annotated each year.

For example, a user asks whether to choose a cat or a dog as a pet. In 2023 GPT‑4 replied with a safe, generic answer, while in 2024 the same model added more informal suggestions, illustrating how model updates can render previous responses stale and necessitate re‑labeling.

The core difficulty lies in standardizing JSON output formats. Variations include:

Markdown‑wrapped JSON (surrounded by ```json and ```).

Different intent values (e.g., 0, 2, 4) that control indentation or line breaks.

JSONL format, where each JSON object occupies a single line.

Requests that forbid markdown rendering.

Prompts that demand "analyze first, then output JSON" or "output only JSON and nothing else".

To manage these, two practical steps are recommended:

Choose a single canonical JSON style (e.g., with or without markdown, a fixed intent value, single‑line or multi‑line) and convert every SFT example to that style.

Augment prompts with explicit instructions so the model consistently follows the chosen format.

When a few‑shot example contains malformed JSON, the model may either mimic the error or enforce proper syntax. Testing five leading providers (GPT‑4o, Doubao, Kimi, Qwen, and Wenxin) shows divergent behaviors:

GPT‑4o: markdown + standard JSON + intent 2

Doubao: standard JSON + intent 0

Kimi: markdown + few‑shot JSON + JSONL

Qwen: few‑shot JSON + JSONL

Wenxin: markdown + standard JSON + intent 4

The takeaway is that any JSON style works for training as long as it is applied uniformly; inconsistency harms both model learning and downstream batch requests.

Another subtle issue is numeric representation. Using raw float or int values can omit essential units (e.g., "million $"), reducing accuracy for financial tasks. To preserve units, add an explicit unit field in the JSON output, as shown below:

{
  "2018": {
    "net_profit": 5678,
    "unit": "million$"
  }
}

Additional challenges include escaping dollar signs in LaTeX expressions, handling Chinese punctuation, and deciding whether to insert commas in large numbers. These details often require bulk text processing scripts.

In summary, the most critical lesson is not the specific JSON format but the strict consistency of that format across the entire SFT dataset.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Training SFT JSON Formatting LLM data cleaning

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.