Why Sub‑word Tokenizers Power Modern LLMs: From Characters to Tokens
This article explains how language models evolved from character‑level embeddings to word‑level and finally to sub‑word tokenizers, highlighting the efficiency, vocabulary coverage, and practical engineering challenges of sub‑word segmentation in modern AI systems.
1. From Characters to Words: Evolution Path
Early language models processed text at the character level, forcing neural networks to discover word boundaries and morphological relationships from scratch. This approach required large amounts of data and computation because the model had to learn that sequences like c‑a‑t form the word "cat" and that the trailing s indicates a plural.
Switching to whole‑word tokens reduces the learning burden but creates vocabularies of hundreds of thousands of entries, which dramatically increases memory consumption and prevents the model from sharing sub‑components between related words.
2. Sub‑word Tokenization
Sub‑word tokenizers split words into smaller, reusable units. Typical examples are:
"unhappy" → ["un", "happy"]
"cats" → ["cat", "s"]
"中文分词" → ["中", "文", "分", "词"] (illustrative)
Key advantages:
Shared components enable generalization When the model learns the embedding for happy and the prefix un , it can immediately understand unhappy without storing a separate entry.
Robustness to rare or novel words New terms such as "ChatGPT" are decomposed into known sub‑words (e.g., ["Chat", "G", "P", "T"]). The model can therefore handle out‑of‑vocabulary tokens gracefully.
Balanced efficiency and coverage Popular algorithms (Byte‑Pair Encoding, WordPiece, SentencePiece) typically produce vocabularies of 30‑50 k tokens. This reduces storage by more than 80 % compared with full‑word vocabularies while still covering the vast majority of linguistic phenomena.
Real‑world example: GPT‑3’s BPE tokenizer splits slang "LOLcats" into ["LOL", "cat", "s"], preserving both meaning and grammatical structure.
3. Engineering Practices for Tokenizers
A production‑grade tokenizer must solve three challenges:
Intelligent splitting Consider the word "underground". Depending on corpus statistics, the tokenizer may prefer ["under", "ground"] (common collocation) over ["un", "der", "ground"] (less frequent).
Chinese word segmentation Because Chinese lacks whitespace, a good tokenizer should produce meaningful segments such as ["南京", "市", "长江", "大桥"] rather than incorrect splits like ["南京", "市长", "江大桥"].
Unified encoding scheme Modern tools (e.g., SentencePiece) follow a two‑step pipeline: (1) Convert raw text to Unicode characters; (2) Iteratively merge the most frequent character sequences into sub‑words, yielding a mixed vocabulary of single characters, morphemes, and whole words.
This process is analogous to building a compression archive: high‑frequency patterns are retained, low‑frequency patterns are discarded, achieving a compact representation without sacrificing essential information.
When embeddings are combined with sub‑word tokenization, the model operates on token vectors that are neither single characters nor full words but sub‑words, enabling more efficient and expressive language modeling.
Conclusion
Sub‑word tokenizers represent the latest step in the historical progression from character‑level processing to modern, compact language representations. By decomposing text into reusable pieces, they allow large language models to capture linguistic structure efficiently while remaining adaptable to new vocabulary and multilingual contexts.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
