Does Synthetic Data Have a Future? Evidence‑Based Conclusions
A detailed investigation of two public programming‑training datasets shows that AI‑only synthetic data suffers from severe quality issues, and even AI‑plus‑expert review yields only about ten percent usable examples, proving that high‑quality training data still requires domain experts and rigorous quality‑control processes.
Background
When large language models became capable of generating data, many asked whether manual annotation is still needed. Model distillation—letting a strong model generate questions for a weaker one—has indeed boosted production speed, but the author argues that this conclusion does not hold up under scrutiny.
Industry Context
The author points out that companies such as Scale AI are not simple outsourcing firms. Their value lies in a scientific management pipeline, expert recruitment, and task‑evaluation systems that turn scattered human knowledge into core AI assets. Similar valuations are seen in other data‑labeling unicorns, indicating a broader industry belief that high‑quality human‑curated data is a decisive factor for future AGI competition.
Empirical Study
A research team from the "Intelligent Knowledge" AI‑data‑labeling group sampled two publicly available programming‑training datasets: SETA (released by Camel‑AI) and Terminal‑Bench Pro (open‑sourced by Alibaba). They selected problems marked as hard and performed a three‑step review:
Check whether the reference answer passes its own tests (oracle verification).
Conduct static code review.
Use a large model to solve the problem dynamically and analyze the reasoning trace.
The results were strikingly similar for both datasets, with roughly nine‑tenths of the items exhibiting serious quality problems.
Problem Types and Frequencies
"Standard answer" is wrong : ~20% (SETA) vs ~10% (Terminal‑Bench Pro).
No training value : ~35% (SETA) vs ~45% (Terminal‑Bench Pro).
Problem‑test mismatch : ~35% for both.
Barely usable : ~10% for both.
The dominant issues are fake or overly easy problems and test distortion, together accounting for 70‑80% of the failures.
Concrete Failure Cases
Examples from the audit include:
Harbor‑Dataset/25: the generated Dockerfile references a non‑existent sample.log, causing the container build to fail.
Harbor‑Dataset/82: the task asks for a command‑line tool, but the test script hard‑codes the filename /app/pkg_resolver.py, penalising correct solutions that use a different name.
Harbor‑Dataset/836: the specification requires a --dry‑run flag, yet the test suite never checks it, allowing a completely missing implementation to receive full credit.
Harbor‑Dataset/849: the bug cause is spelled out in the problem description, so the model can simply copy the answer.
build‑python‑sokoban‑solver (Terminal‑Bench Pro): a copy‑paste error makes the right‑direction mapping identical to the down‑direction, and the test only validates one map, letting the flawed solution pass.
build‑nginx‑1‑24‑production‑server: the test sends only five requests and counts any 200/503 response as passing, so a server without real rate‑limiting can still score full marks.
Pure AI Limitations
The audit shows that AI‑generated data often suffer from three major shortcomings:
Environment incompatibility – missing files or incorrect Dockerfile references.
Unreliable tests – reward signals that are either wrong or absent, turning training into noise.
Trivial problems – answers are embedded in the prompt, so the model merely copies without reasoning.
Consequently, synthetic data cannot exceed the capabilities of the model that generated it.
Human‑in‑the‑Loop Does Not Solve All Issues
Adding expert review (AI + expert) improves the dataset only marginally; the qualified rate remains around ten percent. Two deeper problems emerge:
Expert management is required – without strict processes, even simple mistakes slip through, as seen in the Sokoban direction error.
Domain expertise is essential – only specialists can spot hidden flaws such as the ineffective Nginx rate‑limiting test.
Simple expert checks alone are insufficient; organized workflows and domain‑specific knowledge are also critical.
Conclusion
The study demonstrates that high‑quality programming training data cannot be produced by fully automated AI synthesis nor by merely having a human glance at the output. It requires "domain experts + rigorous quality‑management processes". Relying on cheap synthetic data creates a brittle foundation that can quickly erode trust in real‑world applications.
For readers interested in data‑labeling, the original research and dataset links are provided in the appendix.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
