Artificial Intelligence 21 min read

Building High‑Performance Vertical Domain LLMs: From Continued Pre‑Training to Retrieval‑Augmented Generation

This article systematically explains how to create vertical domain large language models by continuing pre‑training on domain data, constructing fine‑tuning datasets with self‑instruct, reducing hallucinations, and integrating knowledge retrieval, while also reviewing related papers, products, and system architectures.

Baobao Algorithm Notes

Nov 9, 2023

Building High‑Performance Vertical Domain LLMs: From Continued Pre‑Training to Retrieval‑Augmented Generation

Background

General‑purpose large language models (LLMs) can perform many tasks but often do so sub‑optimally. In contrast, domain‑specific (vertical) LLMs focus on a narrow set of tasks, achieve higher accuracy, and are more trustworthy for productivity‑critical applications such as generating correct SQL statements.

Typical Vertical‑LLM Development Pipeline

Continue Pre‑Training : Train a base model further on domain‑specific corpora to inject specialized terminology and knowledge.

Supervised Fine‑Tuning (SFT) : Align the model with domain tasks and desired answer styles.

Reinforcement Learning from Human Feedback (RLHF) : Refine responses to match professional tone and user preferences.

Vertical models usually employ Retrieval‑Augmented Generation (RAG): they first retrieve relevant knowledge and then generate answers, which reduces hallucinations, improves timeliness, and enables rapid intervention.

Continue Pre‑Training

Continuing pre‑training on domain data allows the model to learn specialized vocabularies. For example, the scientific LLM Mozi was further trained on a 4 B‑token scientific corpus, reducing perplexity from 6.95 to 3.46 and improving downstream task scores from 0.38 to 0.52.

Mixed‑Domain Data

To avoid catastrophic forgetting, mix generic text with domain data. The financial model XuanYuan (based on Bloom) used a hybrid‑tuning strategy that combined both data types during pre‑training and instruction tuning, preserving general capabilities while excelling on finance queries.

Training From Scratch

Training a model from zero is possible (e.g., BloombergGPT) but still requires a large proportion of generic text to learn basic language and world knowledge. BloombergGPT’s training set contains ~48 % generic data.

Domain Fine‑Tuning Data Construction

High‑quality, large‑scale instruction data are essential. Three automated generation methods are commonly used:

Self‑Instruct : Start from ~100 seed instructions, use GPT‑4 to expand them into thousands of new instruction‑input‑output triples.

Self‑QA : Generate instructions directly from unstructured documents, then let GPT‑4 answer them.

Self‑KG : Sample triples from a high‑quality knowledge graph and prompt GPT‑4 to create corresponding instructions.

Self‑Instruct Workflow

Select a seed instruction.

Prompt GPT‑4 to generate similar instructions.

Classify each instruction as a classification or generation task.

For classification tasks, use an “output‑first” strategy: generate the label, then craft an input that matches the label.

For generation tasks, use an “input‑first” strategy: generate the input sentence, then produce the output.

Filter low‑quality or duplicate entries and iterate.

Experiments show that 175 seed instructions can yield >82 000 high‑quality samples, with 92 % of generated instructions being meaningful.

Self‑QA Workflow

When seed instructions are unavailable, GPT‑4 first creates plausible instructions from a document, then answers them, forming instruction‑input‑output triples. Heuristic filtering improves quality.

Self‑KG Workflow

Given a knowledge graph, sample a triple (entity‑relation‑entity) and prompt GPT‑4 to generate an instruction that requires reasoning over that triple, producing domain‑specific fine‑tuning data.

Hallucination Mitigation

Generating citations during answer generation improves factual consistency. Users can quickly verify answers by checking the provided references.

Factual Consistency Evaluation

This task is cast as Natural Language Inference (NLI): given a premise (retrieved document) and a hypothesis (model answer), a classifier predicts entailment, neutral, or contradiction. Datasets such as Adversarial NLI can be used to train evaluators (e.g., T5) that detect hallucinations.

Knowledge Retrieval

Effective retrieval is crucial for RAG. Two main dense retriever families are:

Dense Passage Retrieval (DPR) : Dual‑tower architecture with separate encoders for queries and passages, trained with contrastive loss.

Generalizable T5‑based Retriever (GTR) : Single‑tower T5 encoder for both queries and passages, offering higher recall at the cost of speed.

Keyword‑Generation LLM

In specialized domains, user queries are often colloquial while documents are technical. Models such as ChatDoctor first generate domain‑relevant keywords from the query, then retrieve documents using those keywords, improving relevance.

Context Rewriting for Multi‑Turn Dialogues

User:"中国的首都是哪里？"
Bot: "中国的首都是北京"
User: "那里有哪些景点？"

The last turn lacks context; a rewriting step transforms it into an independent query "北京有哪些旅游景点？" before retrieval.

Document Selection and Ranking

Empirical results suggest retrieving 3–4 documents balances recall and precision. Further re‑ranking can be performed by prompting an LLM to select the most helpful passages (e.g., ChatDoctor’s re‑ranking step).

System Architecture

Vertical LLMs are best built as end‑to‑end pipelines with three modules:

Query understanding / keyword generation.

Knowledge retrieval (dense or keyword‑based).

Reasoning and answer generation (often RAG).

Examples include Alibaba Cloud’s architecture (query parsing → retrieval → reasoning) and ChatLaw’s design (keyword‑generation LLM, retrieval component, domain‑specific LLM).

Open Challenges

Graceful refusal for out‑of‑domain queries.

Effective chunking of long documents for encoding.

Accelerating copy‑heavy answer generation.

Integrating domain‑specific tools (e.g., simulators, calculators) with LLMs.

Handling multiple questions in a single turn.

Balancing hallucination reduction with the model’s ability to generalize.

Addressing these challenges will further close the gap between research prototypes and production‑ready vertical LLM applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

knowledge retrieval AI research vertical LLM self-instruct

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.