Unlocking Domain-Specific Large Model Training: Proven Tricks and Pitfalls
This article shares practical techniques for domain‑specific large model continue pre‑training, including data selection, mixing ratios with general data, multi‑task instruction pre‑training, resource‑aware fine‑tuning strategies, evaluation set design, vocabulary considerations, and deployment constraints for 7‑13B models.
Domain Large Model Continue Pre‑Training Strategies
Domain‑specific high‑quality data such as technical standards, books, papers, and relevant websites provide dense knowledge that improves model adaptation to specialized tasks.
Mixing general data mitigates catastrophic forgetting of broad abilities. Empirical ratios:
BloombergGPT (from‑scratch) used a 1:1 domain‑to‑general mix.
ChatHome (continue pre‑training) found 1:5 (domain:general) optimal; when domain data is scarce, 1:5–1:10 is recommended.
Multi‑Task Instruction Pre‑Training (MIP) : add downstream SFT datasets (e.g., T5, ExT5, GLM‑130B) during continue pre‑training. This injects instruction‑following behavior early and has shown large gains on domain evaluation sets.
Resource‑aware SFT : choose the base model according to data volume and GPU capacity.
~5k samples → fine‑tune a chat‑style model (preserves dialogue format).
~100k samples → fine‑tune a base model for higher capacity.
Preserve original chat system prompts and input format when fine‑tuning a chat model; avoid full‑parameter training to reduce forgetting of base capabilities.
Evaluation set design : create two complementary benchmarks:
Automatic multiple‑choice test for rapid screening.
Manual open‑ended test for thorough, real‑world assessment.
Vocabulary expansion generally yields limited performance improvement; its main benefit is faster decoding rather than higher accuracy.
Open‑source base models : a growing ecosystem of 7B–13B models (e.g., ChatGLM, BaiChuan, Qwen, LLaMA) enables incremental pre‑training and fine‑tuning without building a new foundation model.
Practical Deployment Considerations
Task‑focused deployment often means replacing a rule‑based component (e.g., Text‑to‑SQL) with a model‑generated solution, which can raise accuracy above 90% and outperform generic APIs.
Scenario outweighs raw model size : end‑to‑end solutions (e.g., GPT‑4, AutoGPT) set high user expectations; selecting the right use‑case and packaging the model appropriately is crucial.
Hardware constraints : most enterprises target on‑premise models around 10‑13 B parameters. Even with acceleration libraries such as llama.cpp, models >100 B remain impractical for typical on‑prem deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
