Artificial Intelligence 9 min read

Boost Large‑Model Fine‑Tuning with Low‑Cost Data Selection and Construction

The article explains practical techniques for choosing and constructing fine‑tuning data for large language models, covering data diversity through similarity‑based clustering, semi‑supervised filtering with binary classifiers, and uncertainty‑driven sampling using perplexity or reward models to build an efficient, low‑cost pipeline.

Baobao Algorithm Notes

Dec 11, 2023

Boost Large‑Model Fine‑Tuning with Low‑Cost Data Selection and Construction

Problem Context

Fine‑tuning large language models requires careful selection of training data. Real‑world corpora often follow a long‑tail distribution: a few classes dominate the majority of samples while many classes are severely under‑represented. Randomly annotating scraped text therefore yields severe class imbalance, high labeling cost, and sub‑optimal model performance.

Active‑Learning Principles

Two classic active‑learning criteria guide data selection:

Data diversity : ensure the dataset covers a wide range of semantic regions.

Model uncertainty : prioritize examples the current model finds difficult.

1. Ensuring Data Diversity

Deduplication via similarity measurement

Compute a similarity score between text pairs and remove near‑duplicates. Common similarity encoders include:

Semantic vectors from contrastive learning (e.g., sentence‑BERT, SimCSE).

Bag‑of‑words or TF‑IDF cosine similarity for lightweight pipelines.

After obtaining similarity scores, apply one of the following clustering strategies:

One‑pass clustering : iterate through the corpus, assign a sample to an existing cluster if its similarity to the cluster centroid exceeds a threshold; otherwise start a new cluster.

K‑Center‑Greedy : iteratively select points that maximize the minimum distance to already chosen points, yielding a compact yet diverse subset.

Semi‑supervised filtering for novel data

If a high‑quality curated set C already exists, treat C as positive (label 1) and the remaining pool P as negative (label 0). Train a binary classifier (e.g., DeBERTa) with K‑fold cross‑validation. For each fold, predict probabilities on the test split; samples consistently receiving probabilities close to 0 across all folds are deemed dissimilar to C and can be added to the training pool.

2. Targeting Model Uncertainty

Per‑token perplexity (PPL)

Run the current model on each candidate example and compute the average per‑token PPL. Higher PPL indicates lower confidence. For instruction‑style data (question + multiple answer options), sum the model’s probability for each answer token; low total probability signals uncertainty.

Quality‑aware uncertainty via a reward model

Uncertainty alone may select low‑quality noise. Build a reward model—a binary quality classifier—using the same architecture (e.g., DeBERTa) trained on a small labeled subset of high‑quality instruction data. The reward model outputs a quality score q ∈[0,1].

Final selection criterion:

# Pseudocode
for example in candidate_set:
    ppl = model.perplexity(example)
    prob_sum = sum(model.probabilities(answer_options))
    quality = reward_model.predict(example)
    if (ppl > ppl_threshold or prob_sum < prob_threshold) and quality > quality_threshold:
        select(example)

This combines "model uncertainty" with a "quality‑above‑threshold" filter, analogous to manual rejection sampling but fully automated.

3. End‑to‑End Data Construction Pipeline

Collect raw text from the target domain.

Deduplicate using similarity‑based clustering (one‑pass or K‑Center‑Greedy).

If a curated seed set exists, run the semi‑supervised binary classifier to extract novel, dissimilar examples.

Evaluate remaining candidates with the current model to obtain PPL or answer‑option probabilities.

Score candidates with the reward model to filter out low‑quality uncertain samples.

Aggregate the selected examples as the fine‑tuning dataset.

This pipeline reduces labeling cost, mitigates long‑tail imbalance, and systematically improves model performance for any supervised fine‑tuning task.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Clustering Reward Model large model Active Learning data selection semi-supervised

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.