How to Choose High-Quality Instruction Data for LLM Fine‑Tuning: Methods Compared
This article surveys and categorizes instruction data selection techniques for large language model fine‑tuning, explaining metric‑based, trainable‑LLM, powerful‑LLM, and small‑model approaches, detailing representative papers, their pipelines, and empirical findings on data quality and diversity.
Problem Setting
Given a dataset X = {x₁, x₂, …, xₙ} of n instruction fine‑tuning examples, a selection method \(\pi\) chooses a subset S(m) of size m. The quality of S(m) is measured by a predefined evaluation metric Q, which serves as the benchmark for comparing different selection strategies.
Four Broad Categories of Selection Methods
Metric‑Based Methods
Trainable‑LLM Methods
Powerful‑LLM Methods
Small‑Model Methods
Metric‑Based Methods
These approaches define a set of explicit metrics I₁, I₂, …, Iₙ (e.g., instruction length, perplexity, reward score, KNN‑i) and compute a score for each instance: scoreᵢⱼ = Iᵢ(xⱼ). By aggregating scores, a comprehensive metric system is built to rank or filter data.
After scores are computed, a threshold can be set to include only high‑scoring instances.
[Cao et al., 2023] INSTRUCTMINING builds a linear model over the above metrics. Parameters are estimated via least‑squares regression on fine‑tuning experiments that correlate metric values with downstream loss.
[Wei et al., 2023] InstructionGPT‑4 combines CLIP scores, instruction length, and dimensionality‑reduced embeddings as metrics, feeds the resulting vector into a trainable selector (MLP or self‑attention), and clusters data to assign quality labels.
Trainable‑LLM Methods
Large language models themselves act as trainable selectors, scoring each instruction after a brief fine‑tuning phase.
[Li et al., 2023a] IFD first fine‑tunes an LLM on a small clustered subset, then introduces an "Instruction‑Facing Difficulty" (IFD) metric that measures the drop in performance when the instruction is removed.
[Li et al., 2023b] Instruction Back‑Translation generates candidate instructions with a base model, fine‑tunes on seed instructions, then lets the model score each generated instruction; those above a threshold form a high‑quality subset.
[Li et al., 2023c] Nuggets Framework uses zero‑shot scoring on predefined tasks, then one‑shot scoring with each instruction as a prompt; the difference yields a "gold score" for ranking.
[Wu et al., 2023] DIVERSEEVOL iteratively selects data using LLaMA embeddings and a k‑center‑greedy algorithm to maximize diversity across iterations.
Powerful‑LLM Methods
State‑of‑the‑art LLMs (e.g., GPT‑4, ChatGPT) are used as zero‑shot evaluators via carefully crafted prompts.
[Chen et al., 2023b] ALPAGASUS prompts ChatGPT to assess each (instruction, input, response) tuple, filtering out low‑quality items before fine‑tuning.
[Lu et al., 2023] INSTAG asks ChatGPT to generate detailed open‑ended tags for each instruction, then selects a subset by maximizing tag diversity and complexity.
Small‑Model Methods
External lightweight models serve as scorers or embedder‑based selectors.
[Du et al., 2023] MoDS combines three criteria—quality (via a reward model), coverage (k‑center‑greedy seed selection), and necessity (impact on LLM fine‑tuning)—through a four‑step pipeline that iteratively refines the data subset.
[Chen et al., 2023a] Coreset‑based Selection extracts embeddings with a pretrained model (e.g., BERT), clusters them unsupervisedly, and applies the K‑Center‑Greedy algorithm to pick representative core samples.
Other Notable Approaches
[Kung et al., 2023] Active Instruction Tuning measures "Prompt Uncertainty" by randomly deleting words from an instruction, generating k perturbed versions, and averaging the LLM output probability variance; tasks with higher uncertainty are prioritized for fine‑tuning.
[Xu et al., 2023b] LIFT expands dataset diversity with ChatGPT‑generated instructions, then selects a subset based on row‑variance and scores (accuracy, explainability, clarity, difficulty, length).
[Liu et al., 2023] DEITA assigns each instruction a complexity score (c) and quality score (q) using a specialized complexity scorer and a quality evaluator; the product c·q yields a composite score for ranking and diversity‑aware subset construction.
[Zhao et al., 2023] tree‑instruct builds semantic parse trees with GPT‑4, uses node count as a complexity metric, augments trees to increase complexity, and converts them back to natural‑language instructions.
[Yu et al., 2023] WaveCoder filters code‑centric LLM data using a GPT‑4‑based discriminator that evaluates multiple sub‑criteria, ensuring fine‑grained control over instruction quality.
Overall, these methods illustrate a spectrum of strategies—from simple handcrafted metrics to sophisticated LLM‑driven evaluators—each balancing data quality, diversity, and computational cost for effective instruction fine‑tuning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
