Baobao Algorithm Notes
Nov 13, 2024 · Artificial Intelligence
Why Cleaning SFT Data Is a Nightmare: Hidden JSON Formatting Pitfalls
Cleaning SFT data for LLMs is surprisingly complex, as subtle JSON formatting variations, inconsistent markdown wrappers, intent settings, and unit handling can cause model inconsistencies, requiring unified standards, careful prompt design, and extensive manual review to ensure reliable training outputs.
JSON formattingLLM data cleaningModel Training
0 likes · 8 min read
