Artificial Intelligence 20 min read

Guidelines for Building Domain-Specific Large Models: Dataset Construction, Training Methods, Evaluation, and Hardware Benchmarking

This article presents a comprehensive guide on constructing domain-specific large language models, covering the differences from general models, how to build high‑quality domain datasets, selecting appropriate training methods, designing validation sets, evaluating model capabilities, and benchmarking domestic hardware performance.

DataFunTalk
DataFunTalk
DataFunTalk
Guidelines for Building Domain-Specific Large Models: Dataset Construction, Training Methods, Evaluation, and Hardware Benchmarking

01

Domain Large Model vs. General Large Model

General large models are trained on broad, generic datasets and often lack deep domain knowledge, which can cause issues when applied to specific enterprise scenarios such as building a smart Q&A bot for Dipu Technology.

Two main solutions exist: Retrieval‑Augmented Generation (RAG) using external knowledge bases, or fine‑tuning a general model with domain data to create a domain‑specific model.

Article Outline

Differences between domain and general large models

Construction of domain datasets from enterprise data

Selection of model training methods

Construction of validation sets and model evaluation methods

Domestic hardware benchmarking

Q&A session

02

1. Dataset Differences

General models use many open‑source datasets with wide coverage but limited depth in specific industries.

Domain models can incorporate general data but must supplement with industry‑specific data, which is often scarce.

2. Flexibility vs. Accuracy

General models are highly flexible via prompts but may lack accuracy in specialized domains; domain models trade flexibility for higher accuracy.

3. Complexity

General models are more complex; domain models can achieve required performance with smaller parameter counts.

03

Building Domain Datasets from Enterprise Data

Key challenges:

Few high‑quality domain datasets available.

Data preprocessing is costly and may conflict with privacy constraints.

Balancing data diversity affects model flexibility and accuracy.

(1) SFT (Supervised Fine‑Tuning)

When high‑quality data is scarce, generate Q&A pairs from documents (e.g., using ChatGPT or manual effort) and fine‑tune on ~1,000 pairs to achieve acceptable performance.

(2) Leveraging the Model’s Own Capabilities

Use large models (Claude‑2, GPT‑4, GPT‑3.5, LLaMA‑2 13B, ChatGLM‑2 6B) to extract structured knowledge from raw documents via prompts, reducing manual effort while preserving privacy.

(3) Balancing Data Diversity

Maintain roughly 30 % domain data in the training set; higher ratios improve accuracy but reduce flexibility.

(4) Summary

Use SFT to lower data quality requirements.

Exploit existing LLMs for knowledge extraction.

Combine general and domain data via a flexible knowledge base.

04

Model Training Method Selection

Common methods: Full‑parameter fine‑tuning, P‑tuning, LoRA, Q‑LoRA.

1. Comparison of Training Methods

Full‑parameter fine‑tuning : High hardware cost (e.g., A800 for 13B models), high accuracy, low flexibility.

Efficient fine‑tuning (LoRA, Q‑LoRA) : Low hardware requirements (can run on consumer GPUs or even Apple M1 Pro), more flexible, but accuracy may vary by scenario.

2. Choosing the Method

If the dataset only adjusts output format, use efficient fine‑tuning (LoRA/Q‑LoRA).

If the model must memorize new knowledge, prefer full‑parameter fine‑tuning.

Consider hardware availability: 13B models need A800; 7B models can run on RTX 4090.

Consider desired outcome: prioritize accuracy → full‑parameter; prioritize flexibility → efficient methods.

3. Practical Recommendation

Start with Q‑LoRA; if unsuitable, fall back to LoRA; if still insufficient, use full‑parameter fine‑tuning.

05

Validation Set Construction & Model Evaluation

After costly training, verify model usefulness and memory retention.

1. Challenges

Domain models lack universal benchmark datasets, requiring custom evaluation.

2. Five‑Dimensional Capability Assessment

Tokenization ability

Syntactic & grammatical analysis

Semantic disambiguation

Understanding

Overall comprehension

3. Example Evaluation Tasks

Tokenization: ask model to output token list.

Syntactic analysis: extract subject, predicate, object.

Semantic similarity: compare meaning of two sentences.

Disambiguation: identify entity meaning in context.

Understanding: extract key information from long text.

4. Validation Set Preparation Methodology

Prepare a generic validation dataset.

Prepare domain‑specific datasets covering the five dimensions.

Use a baseline open‑source model (e.g., LLaMA‑2, ChatGLM) for comparative radar‑chart analysis.

06

Domestic Hardware Benchmarking

Inference speed (tokens/s) for 13B and 7B models:

NVIDIA A800: 13B ≈ 33 t/s, 7B ≈ 45 t/s.

Moore Threads S3000: 13B ≈ 20 t/s.

Moore Threads S4000: 13B ≈ 29 t/s.

Huawei Ascend 910A: 13B ≈ 15 t/s, 7B ≈ 23 t/s.

All cards are compatible after converting NVIDIA‑trained models.

07

Q&A Session

Q1: Can model validation be automated?

A1: Yes, many checks (e.g., tokenization) can be scripted; proper prompts enable structured outputs for automated comparison.

Q2: How does a model fine‑tuned on 1‑2k domain examples perform on precise queries?

A2: It can answer accurately if the query is covered by the training data; otherwise it may hallucinate.

Q3: Will slight re‑phrasings break the model?

A3: Small changes can cause errors, especially after full‑parameter fine‑tuning; additional data and validation are needed to mitigate.

Q4: How to iterate when validation shows low accuracy?

A4: Diagnose which capability is weak (e.g., tokenization vs. understanding) and either increase data volume/quality or switch training methods accordingly.

Thank you for reading.

AIFine-tuninglarge language modelmodel evaluationdataset constructiondomain modelhardware benchmarking
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.