Artificial Intelligence 20 min read

Mastering LLM Training: From Tokenizer Design to Instruction Tuning

This article provides a comprehensive, step‑by‑step guide to building large language models, covering tokenizer creation, vocabulary expansion, pre‑training strategies, dataset cleaning, instruction‑tuning techniques, and evaluation metrics such as C‑Eval and GPT‑4 based scoring.

Baobao Algorithm Notes

Aug 21, 2023

Mastering LLM Training: From Tokenizer Design to Instruction Tuning

1. Pretraining Stage

Many recent works fine‑tune a strong base model (e.g., Alpaca, Vicuna) because the pre‑trained model already contains much of the knowledge needed for downstream tasks. However, practical issues arise when open‑source backbones lack Chinese support or domain‑specific expertise, requiring additional data or tokenizer adjustments.

1.1 Tokenizer Training

Before pretraining, a suitable base model must be chosen. Most high‑performing LLMs are trained primarily on English data, so Chinese‑language projects often perform a second‑stage pretraining on Chinese corpora. Tokenizer training involves two common algorithms: WordPiece and Byte‑Pair Encoding (BPE).

WordPiece stores frequently used characters and words in a vocabulary and looks them up during tokenization. For example:

输入句子 >>> 你好世界
切词结果 >>> ['你', '好', '世', '界']

BPE operates on Unicode code points, which is more scalable for multilingual vocabularies. In Chinese, each character consists of three Unicode bytes, so a character not present in the vocabulary is split into three tokens.

Vocabulary expansion is often performed to reduce training difficulty by manually adding common Chinese characters. Chinese‑LLaMA added 17,953 tokens (mostly Chinese characters) to the original LLaMA tokenizer, and BELLE followed a similar approach with a 5‑word‑million‑token set.

1.2 Language Model Pretraining

Pretraining treats the task as next‑token prediction on large text corpora. Data‑source sampling (e.g., GPT‑3) balances the influence of massive datasets like Common Crawl (60% sampling) against smaller ones such as Wikipedia (higher per‑epoch exposure). Data preprocessing splits long documents into fixed‑length chunks (e.g., seq_len = 2048) instead of truncating only the head.

Model‑speed tricks focus on attention computation (MQA, FlashAttention) and positional embeddings (ALiBi, RoPE) to improve training efficiency and generalization across sequence lengths.

1.3 Dataset Cleaning

Chinese pretraining data often draws from sources like WuDao (predominantly encyclopedia and blog content). Open‑source datasets can be used for experimentation, but high‑performance models typically require custom data pipelines. The Falcon paper reports that cleaned internet data can outperform carefully curated datasets.

2. Instruction Tuning Stage

After pretraining, models excel at continuation but often fail to produce direct answers. Instruction tuning aligns the model with human‑like dialogue by providing paired "instruction‑input‑output" examples.

2.1 Self‑Instruction

ChatGPT can be leveraged to generate synthetic instruction data (self‑instruction). Projects such as Stanford Alpaca used seed prompts to let ChatGPT produce both questions and answers, while Chinese projects like BELLE applied the same idea to create large‑scale instruction datasets.

2.2 Open‑source Instruction Datasets

Alpaca contains 5,200 instruction‑response pairs. A typical JSON entry looks like:

{
    "instruction": "Arrange the words in the given sentence to form a grammatically correct sentence.",
    "input": "quickly the brown fox jumped",
    "output": "The quick brown fox jumped quickly."
}

BELLE releases multiple scales (1M, 2M, 3.5M) of Chinese instruction data. An example entry:

{
    "instruction": "判断给定的文章是否符合语法规则。如果不符合，请提供修改建议。",
    "input": "",
    "output": "这个开头符合语法规则。"
}

2.3 Evaluation Methods

Traditional language‑model metrics (PPL, NLL) are insufficient for instruction‑tuned models. C‑Eval, a Chinese knowledge benchmark with 14,000 multiple‑choice questions across 52 subjects, evaluates factual reasoning by prompting the model to output the answer token (A‑D) and measuring accuracy.

Recent works score model generations using GPT‑4 as a judge, assigning a 0‑10 score to each answer. However, GPT‑4 scores can be biased; the authors performed manual review to verify discrepancies, noting that GPT‑4 sometimes prefers longer phrasing that does not match the original instruction.

Other emerging benchmarks include PandaLM, open‑llm‑leaderboard, and various domain‑specific test sets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM training

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.