Guidelines for Building Domain-Specific Large Models: Dataset Construction, Training Methods, Evaluation, and Hardware Benchmarking
This article presents a comprehensive guide on constructing domain-specific large language models, covering the differences from general models, how to build high‑quality domain datasets, selecting appropriate training methods, designing validation sets, evaluating model capabilities, and benchmarking domestic hardware performance.
01
Domain Large Model vs. General Large Model
General large models are trained on broad, generic datasets and often lack deep domain knowledge, which can cause issues when applied to specific enterprise scenarios such as building a smart Q&A bot for Dipu Technology.
Two main solutions exist: Retrieval‑Augmented Generation (RAG) using external knowledge bases, or fine‑tuning a general model with domain data to create a domain‑specific model.
Article Outline
Differences between domain and general large models
Construction of domain datasets from enterprise data
Selection of model training methods
Construction of validation sets and model evaluation methods
Domestic hardware benchmarking
Q&A session
02
1. Dataset Differences
General models use many open‑source datasets with wide coverage but limited depth in specific industries.
Domain models can incorporate general data but must supplement with industry‑specific data, which is often scarce.
2. Flexibility vs. Accuracy
General models are highly flexible via prompts but may lack accuracy in specialized domains; domain models trade flexibility for higher accuracy.
3. Complexity
General models are more complex; domain models can achieve required performance with smaller parameter counts.
03
Building Domain Datasets from Enterprise Data
Key challenges:
Few high‑quality domain datasets available.
Data preprocessing is costly and may conflict with privacy constraints.
Balancing data diversity affects model flexibility and accuracy.
(1) SFT (Supervised Fine‑Tuning)
When high‑quality data is scarce, generate Q&A pairs from documents (e.g., using ChatGPT or manual effort) and fine‑tune on ~1,000 pairs to achieve acceptable performance.
(2) Leveraging the Model’s Own Capabilities
Use large models (Claude‑2, GPT‑4, GPT‑3.5, LLaMA‑2 13B, ChatGLM‑2 6B) to extract structured knowledge from raw documents via prompts, reducing manual effort while preserving privacy.
(3) Balancing Data Diversity
Maintain roughly 30 % domain data in the training set; higher ratios improve accuracy but reduce flexibility.
(4) Summary
Use SFT to lower data quality requirements.
Exploit existing LLMs for knowledge extraction.
Combine general and domain data via a flexible knowledge base.
04
Model Training Method Selection
Common methods: Full‑parameter fine‑tuning, P‑tuning, LoRA, Q‑LoRA.
1. Comparison of Training Methods
Full‑parameter fine‑tuning : High hardware cost (e.g., A800 for 13B models), high accuracy, low flexibility.
Efficient fine‑tuning (LoRA, Q‑LoRA) : Low hardware requirements (can run on consumer GPUs or even Apple M1 Pro), more flexible, but accuracy may vary by scenario.
2. Choosing the Method
If the dataset only adjusts output format, use efficient fine‑tuning (LoRA/Q‑LoRA).
If the model must memorize new knowledge, prefer full‑parameter fine‑tuning.
Consider hardware availability: 13B models need A800; 7B models can run on RTX 4090.
Consider desired outcome: prioritize accuracy → full‑parameter; prioritize flexibility → efficient methods.
3. Practical Recommendation
Start with Q‑LoRA; if unsuitable, fall back to LoRA; if still insufficient, use full‑parameter fine‑tuning.
05
Validation Set Construction & Model Evaluation
After costly training, verify model usefulness and memory retention.
1. Challenges
Domain models lack universal benchmark datasets, requiring custom evaluation.
2. Five‑Dimensional Capability Assessment
Tokenization ability
Syntactic & grammatical analysis
Semantic disambiguation
Understanding
Overall comprehension
3. Example Evaluation Tasks
Tokenization: ask model to output token list.
Syntactic analysis: extract subject, predicate, object.
Semantic similarity: compare meaning of two sentences.
Disambiguation: identify entity meaning in context.
Understanding: extract key information from long text.
4. Validation Set Preparation Methodology
Prepare a generic validation dataset.
Prepare domain‑specific datasets covering the five dimensions.
Use a baseline open‑source model (e.g., LLaMA‑2, ChatGLM) for comparative radar‑chart analysis.
06
Domestic Hardware Benchmarking
Inference speed (tokens/s) for 13B and 7B models:
NVIDIA A800: 13B ≈ 33 t/s, 7B ≈ 45 t/s.
Moore Threads S3000: 13B ≈ 20 t/s.
Moore Threads S4000: 13B ≈ 29 t/s.
Huawei Ascend 910A: 13B ≈ 15 t/s, 7B ≈ 23 t/s.
All cards are compatible after converting NVIDIA‑trained models.
07
Q&A Session
Q1: Can model validation be automated?
A1: Yes, many checks (e.g., tokenization) can be scripted; proper prompts enable structured outputs for automated comparison.
Q2: How does a model fine‑tuned on 1‑2k domain examples perform on precise queries?
A2: It can answer accurately if the query is covered by the training data; otherwise it may hallucinate.
Q3: Will slight re‑phrasings break the model?
A3: Small changes can cause errors, especially after full‑parameter fine‑tuning; additional data and validation are needed to mitigate.
Q4: How to iterate when validation shows low accuracy?
A4: Diagnose which capability is weak (e.g., tokenization vs. understanding) and either increase data volume/quality or switch training methods accordingly.
Thank you for reading.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.