Artificial Intelligence 23 min read

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

This article provides a comprehensive, step‑by‑step guide to training customized large language models, covering industry‑specific needs, data privacy, meticulous data cleaning, optimal data‑ratio balancing, token budgeting, GPU memory accounting, LoRA fine‑tuning techniques, and practical evaluation metrics for robust AI deployment.

DaTaobao Tech

Aug 21, 2024

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

Background

Custom large‑language‑model (LLM) training enables organizations to meet domain‑specific requirements, keep data private, and improve efficiency.

Why Fine‑Tune?

Adapt functionality to particular tasks, increasing accuracy and speed.

Incorporate industry terminology (e.g., code generation, customer‑service dialogue).

Data Privacy

All data stays inside the organization, satisfying privacy regulations.

Reduces risk of external data leakage.

Model Transparency

Training your own model reveals internal mechanisms, aiding retrieval‑augmented generation, knowledge‑graph construction, function calling, and agent development.

Transparency : clearer insight into data processing and decisions.

Explainability : essential for regulated or high‑risk scenarios.

Data Cleaning Pipeline

Typical steps for high‑quality internet data:

Language detection – filter non‑English text for LLaMA preprocessing.

Rule‑based filtering – remove sentences with excessive punctuation or prohibited words.

Scoring model – lightweight classifier to assess content quality.

Deduplication – discard near‑duplicate entries.

Left‑truncate dialogue histories, keeping the most recent turns.

Strip filler words (e.g., "嗯", "啊").

Exclude irrelevant conversational turns (e.g., direct human hand‑off).

Enrich samples with user attributes such as age, gender, region.

Data Ratio (配比)

In the paper 垂域大模型训练 the best results used a 4:1 ratio – 80 % open‑source data and 20 % domain‑specific data. Reference: https://arxiv.org/pdf/2307.15290.pdf

For continued pre‑training, keep domain data below 15 % of the total to preserve general abilities (summarization, QA). The exact threshold varies with model size (e.g., LLaMA may require an even lower proportion).

For supervised fine‑tuning (SFT), a 1:1 domain‑to‑general data ratio often works well when the overall dataset remains manageable.

Training Mechanics

Key factors that determine training time and resource consumption:

Token count = Epoch × #samples × average length × token‑conversion factor.

Batch size – number of samples processed per GPU step.

Epochs – how many passes over the dataset.

Learning‑rate scaling ≈ √(batch‑size increase). Recommended full‑precision LR ≈ 2e‑5; LoRA LR ≈ 5e‑5.

GPU Memory Breakdown (Full‑Precision)

Memory is dominated by four components:

Model parameters.

Activations during the forward pass.

Gradients during the backward pass.

Optimizer states (e.g., AdamW first‑ and second‑moment buffers).

Assuming mixed‑precision (float16 for parameters/gradients, float32 for optimizer states) with AdamW, each parameter occupies 20 bytes (2 bytes for the value, 4 bytes for the gradient, and 2 × 4 bytes for optimizer moments). Total memory ≈ 20 × φ bytes, where φ is the number of parameters.

LoRA Fine‑Tuning

LoRA adds low‑rank adapters (B × A) to the query and key projection matrices (q_proj, k_proj). During training only the adapters are updated; the original weights remain frozen. At inference the adapters are merged, incurring no extra latency.

Updates only ≈0.3‑0.5 % of parameters.

Reduces communication overhead in distributed training.

Leverages low‑precision acceleration for faster training.

Limitation : When abundant data (>10 k samples) and resources are available, full‑parameter fine‑tuning may outperform LoRA.

End‑to‑End Training Workflow

Collect and rigorously clean data (as described above).

Balance data ratio between open‑source and domain data.

Pre‑training or continued pre‑training on text‑only segments.

Supervised fine‑tuning (SFT) in QA format.

Reward‑model training on ranked QA outputs.

Direct Preference Optimization (DPO/PPO) using ranked pairs.

Evaluate across multiple dimensions (knowledge breadth, reasoning, generation quality, ethics, etc.).

Data Formats for Each Stage

Pre‑training: plain‑text segments.

# txt format
Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks.

SFT: JSON objects with instruction, optional input, output, and optional history.

{
  "instruction": "你是谁？",
  "input": "",
  "output": "我是一个由XXX自主训练的生成式自然语言处理模型，名字为GPT，专门用于回答运维相关问题。",
  "history": [["你好", "你好呀！请问有什么需要帮助的地方？"]]
}

Reward Model: JSON with a ranked output list.

{
  "instruction": "我需要举办6人晚宴。你能帮我推荐三道不包括坚果或海鲜的菜吗？",
  "input": "",
  "output": [
    "好的，我将为您推荐三道不包含坚果或海鲜的菜...",
    "尖椒干豆腐，土豆丝，红烧肉",
    "如果是六个人，就点八个菜..."
  ],
  "history": []
}

DPO/PPO: Ranked pairs of chosen and rejected responses.

{
  "instruction": "解释为什么下面的分数等于 1/4
4/16",
  "input": "",
  "output": [
    "分数 4/16 等于 1/4，因为分子和分母都可以被 4 整除。",
    "1/4 与 1/4 相同。"
  ]
}

Evaluation Checklist

Knowledge breadth – answer questions across diverse topics.

Understanding – handle deep‑reading comprehension.

Generation quality – produce coherent, logical text.

Long‑text handling – process and generate extended passages.

Ethics and morality – respond appropriately to moral dilemmas.

Dialogue – maintain natural conversational flow.

Logical reasoning – solve inference problems.

Problem‑solving – answer math or coding challenges.

Sentiment analysis – correctly identify emotions in text.

Model Selection & Fine‑Tuning Strategy

Two base model families are common: base (plain LLM) and chat (base + instruction fine‑tuning). Guidelines:

If the domain gap is large, start from the base version, inject knowledge, then perform SFT.

When resources are limited, the chat version provides built‑in dialogue ability.

LoRA works best with the chat version because the adapter size is small.

Data size < 10 k → prefer chat model; data size > 100 k → prefer base model.

Practical Tips

Use high‑quality domain documents (technical manuals, standards) for continued pre‑training.

Mix a substantial amount of general data (≥85 %) to avoid catastrophic forgetting.

Consider Multi‑Task Instruction Pre‑Training (MIP) by adding SFT data during continued pre‑training.

Keep SFT datasets balanced across representative tasks; avoid overly large per‑task sample counts to prevent over‑fitting.

Set the initial learning rate ≤ 2e‑5 for full‑parameter fine‑tuning; LoRA can use ≈5e‑5. Scale LR with √batch‑size.

GPU Memory Estimation (Mixed‑Precision)

For a model with φ parameters, memory consumption per training step ≈ 20 × φ bytes (parameters + gradients + optimizer states). Example: a 7 B‑parameter model requires roughly 140 GB of GPU memory for full‑precision training; mixed‑precision reduces the requirement proportionally.

References

垂域大模型训练 – https://arxiv.org/pdf/2307.15290.pdf

Instruction Tagging for Analyzing Supervised Fine‑Tuning of Large Language Models – https://arxiv.org/pdf/2308.07074.pdf

大语言模型理论与实践 – https://intro-llm.github.io/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models fine-tuning LoRA model evaluation Data preprocessing AI training GPU memory

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.