Boost Large‑Model Fine‑Tuning with Low‑Cost Data Selection and Construction
The article explains practical techniques for choosing and constructing fine‑tuning data for large language models, covering data diversity through similarity‑based clustering, semi‑supervised filtering with binary classifiers, and uncertainty‑driven sampling using perplexity or reward models to build an efficient, low‑cost pipeline.
