Building High‑Performance Vertical Domain LLMs: From Continued Pre‑Training to Retrieval‑Augmented Generation
This article systematically explains how to create vertical domain large language models by continuing pre‑training on domain data, constructing fine‑tuning datasets with self‑instruct, reducing hallucinations, and integrating knowledge retrieval, while also reviewing related papers, products, and system architectures.
Background
General‑purpose large language models (LLMs) can perform many tasks but often do so sub‑optimally. In contrast, domain‑specific (vertical) LLMs focus on a narrow set of tasks, achieve higher accuracy, and are more trustworthy for productivity‑critical applications such as generating correct SQL statements.
Typical Vertical‑LLM Development Pipeline
Continue Pre‑Training : Train a base model further on domain‑specific corpora to inject specialized terminology and knowledge.
Supervised Fine‑Tuning (SFT) : Align the model with domain tasks and desired answer styles.
Reinforcement Learning from Human Feedback (RLHF) : Refine responses to match professional tone and user preferences.
Vertical models usually employ Retrieval‑Augmented Generation (RAG): they first retrieve relevant knowledge and then generate answers, which reduces hallucinations, improves timeliness, and enables rapid intervention.
Continue Pre‑Training
Continuing pre‑training on domain data allows the model to learn specialized vocabularies. For example, the scientific LLM Mozi was further trained on a 4 B‑token scientific corpus, reducing perplexity from 6.95 to 3.46 and improving downstream task scores from 0.38 to 0.52.
Mixed‑Domain Data
To avoid catastrophic forgetting, mix generic text with domain data. The financial model XuanYuan (based on Bloom) used a hybrid‑tuning strategy that combined both data types during pre‑training and instruction tuning, preserving general capabilities while excelling on finance queries.
Training From Scratch
Training a model from zero is possible (e.g., BloombergGPT) but still requires a large proportion of generic text to learn basic language and world knowledge. BloombergGPT’s training set contains ~48 % generic data.
Domain Fine‑Tuning Data Construction
High‑quality, large‑scale instruction data are essential. Three automated generation methods are commonly used:
Self‑Instruct : Start from ~100 seed instructions, use GPT‑4 to expand them into thousands of new instruction‑input‑output triples.
Self‑QA : Generate instructions directly from unstructured documents, then let GPT‑4 answer them.
Self‑KG : Sample triples from a high‑quality knowledge graph and prompt GPT‑4 to create corresponding instructions.
Self‑Instruct Workflow
Select a seed instruction.
Prompt GPT‑4 to generate similar instructions.
Classify each instruction as a classification or generation task.
For classification tasks, use an “output‑first” strategy: generate the label, then craft an input that matches the label.
For generation tasks, use an “input‑first” strategy: generate the input sentence, then produce the output.
Filter low‑quality or duplicate entries and iterate.
Experiments show that 175 seed instructions can yield >82 000 high‑quality samples, with 92 % of generated instructions being meaningful.
Self‑QA Workflow
When seed instructions are unavailable, GPT‑4 first creates plausible instructions from a document, then answers them, forming instruction‑input‑output triples. Heuristic filtering improves quality.
Self‑KG Workflow
Given a knowledge graph, sample a triple (entity‑relation‑entity) and prompt GPT‑4 to generate an instruction that requires reasoning over that triple, producing domain‑specific fine‑tuning data.
Hallucination Mitigation
Generating citations during answer generation improves factual consistency. Users can quickly verify answers by checking the provided references.
Factual Consistency Evaluation
This task is cast as Natural Language Inference (NLI): given a premise (retrieved document) and a hypothesis (model answer), a classifier predicts entailment, neutral, or contradiction. Datasets such as Adversarial NLI can be used to train evaluators (e.g., T5) that detect hallucinations.
Knowledge Retrieval
Effective retrieval is crucial for RAG. Two main dense retriever families are:
Dense Passage Retrieval (DPR) : Dual‑tower architecture with separate encoders for queries and passages, trained with contrastive loss.
Generalizable T5‑based Retriever (GTR) : Single‑tower T5 encoder for both queries and passages, offering higher recall at the cost of speed.
Keyword‑Generation LLM
In specialized domains, user queries are often colloquial while documents are technical. Models such as ChatDoctor first generate domain‑relevant keywords from the query, then retrieve documents using those keywords, improving relevance.
Context Rewriting for Multi‑Turn Dialogues
User:"中国的首都是哪里?"
Bot: "中国的首都是北京"
User: "那里有哪些景点?"The last turn lacks context; a rewriting step transforms it into an independent query "北京有哪些旅游景点?" before retrieval.
Document Selection and Ranking
Empirical results suggest retrieving 3–4 documents balances recall and precision. Further re‑ranking can be performed by prompting an LLM to select the most helpful passages (e.g., ChatDoctor’s re‑ranking step).
System Architecture
Vertical LLMs are best built as end‑to‑end pipelines with three modules:
Query understanding / keyword generation.
Knowledge retrieval (dense or keyword‑based).
Reasoning and answer generation (often RAG).
Examples include Alibaba Cloud’s architecture (query parsing → retrieval → reasoning) and ChatLaw’s design (keyword‑generation LLM, retrieval component, domain‑specific LLM).
Open Challenges
Graceful refusal for out‑of‑domain queries.
Effective chunking of long documents for encoding.
Accelerating copy‑heavy answer generation.
Integrating domain‑specific tools (e.g., simulators, calculators) with LLMs.
Handling multiple questions in a single turn.
Balancing hallucination reduction with the model’s ability to generalize.
Addressing these challenges will further close the gap between research prototypes and production‑ready vertical LLM applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
