What Is a Token? A Deep Dive into Tokenization Algorithms for LLMs
The article defines tokens (now officially called “词元”), explains why large language models require numeric input, and details three main tokenization strategies—word‑based, character‑based, and subword—along with the sub‑methods BPE, WordPiece, and Unigram, highlighting their advantages and drawbacks.
What is a Token?
Token (词元) is the basic unit processed by large language models. Since LLMs operate on numbers, a tokenizer converts raw text into numeric IDs; each token receives a unique ID used during inference.
Tokenization Algorithms
Tokenizers fall into three families: word‑based, character‑based, and subword‑based. Modern LLMs mainly use subword‑based methods to balance vocabulary size and coverage.
Word‑Based Tokenization
Splits text into words and maps each word to an ID. Example: “I love LLM” → ['I', 'love', 'LLM'].
Advantages: Preserves semantic completeness; easy to understand.
Disadvantages: (1) Out‑of‑vocabulary (OOV) words are replaced by an unknown token, losing meaning. (2) Vocabulary can become very large; English has >500,000 words, requiring a massive token‑ID table.
Character‑Based Tokenization
Splits text into individual characters. Example: “text” → ['t', 'e', 'x', 't'].
Advantages: Very small vocabulary; virtually no OOV tokens because any word can be composed from characters.
Disadvantages: (1) Single characters carry limited semantic information. (2) Token count grows dramatically, increasing computational load for LLMs.
Subword‑Based Tokenization
Breaks words into subwords, keeping frequent words intact while decomposing rare words into meaningful pieces. Core principle: frequent words stay whole; infrequent words are split.
Example: “tokenization” → ['token', 'ization']; simple word “take” remains unsplit.
Three common implementations differ in vocabulary construction: Byte Pair Encoding (BPE), WordPiece, and Unigram.
Byte Pair Encoding (BPE)
Starts with a vocabulary of all characters. Repeatedly finds the most frequent adjacent character pair in the corpus, merges the pair into a new subword, and adds it to the vocabulary. The process repeats until the vocabulary reaches a predefined size.
WordPiece
Improves on BPE by selecting the pair whose merge yields the greatest increase in language‑model likelihood (i.e., better fits linguistic patterns) rather than simply the most frequent pair.
Unigram
Begins with an oversized vocabulary containing all possible subwords and words. Computes a language probability for each subword, then iteratively removes the subword with the lowest probability until the vocabulary reaches the target size.
AgentGuide
Share Agent interview questions and standard answers, offering a one‑stop solution for Agent interviews, backed by senior AI Agent developers from leading tech firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
