What Is a Token? A Deep Dive into Tokenization Algorithms for LLMs

The article defines tokens (now officially called “词元”), explains why large language models require numeric input, and details three main tokenization strategies—word‑based, character‑based, and subword—along with the sub‑methods BPE, WordPiece, and Unigram, highlighting their advantages and drawbacks.

AgentGuide
AgentGuide
AgentGuide
What Is a Token? A Deep Dive into Tokenization Algorithms for LLMs

What is a Token?

Token (词元) is the basic unit processed by large language models. Since LLMs operate on numbers, a tokenizer converts raw text into numeric IDs; each token receives a unique ID used during inference.

Tokenization Algorithms

Tokenizers fall into three families: word‑based, character‑based, and subword‑based. Modern LLMs mainly use subword‑based methods to balance vocabulary size and coverage.

Word‑Based Tokenization

Splits text into words and maps each word to an ID. Example: “I love LLM” → ['I', 'love', 'LLM'].

Advantages: Preserves semantic completeness; easy to understand.

Disadvantages: (1) Out‑of‑vocabulary (OOV) words are replaced by an unknown token, losing meaning. (2) Vocabulary can become very large; English has >500,000 words, requiring a massive token‑ID table.

Character‑Based Tokenization

Splits text into individual characters. Example: “text” → ['t', 'e', 'x', 't'].

Advantages: Very small vocabulary; virtually no OOV tokens because any word can be composed from characters.

Disadvantages: (1) Single characters carry limited semantic information. (2) Token count grows dramatically, increasing computational load for LLMs.

Subword‑Based Tokenization

Breaks words into subwords, keeping frequent words intact while decomposing rare words into meaningful pieces. Core principle: frequent words stay whole; infrequent words are split.

Example: “tokenization” → ['token', 'ization']; simple word “take” remains unsplit.

Three common implementations differ in vocabulary construction: Byte Pair Encoding (BPE), WordPiece, and Unigram.

Byte Pair Encoding (BPE)

Starts with a vocabulary of all characters. Repeatedly finds the most frequent adjacent character pair in the corpus, merges the pair into a new subword, and adds it to the vocabulary. The process repeats until the vocabulary reaches a predefined size.

WordPiece

Improves on BPE by selecting the pair whose merge yields the greatest increase in language‑model likelihood (i.e., better fits linguistic patterns) rather than simply the most frequent pair.

Unigram

Begins with an oversized vocabulary containing all possible subwords and words. Computes a language probability for each subword, then iteratively removes the subword with the lowest probability until the vocabulary reaches the target size.

LLMtokenizationWordPieceBPEsubwordUnigram
AgentGuide
Written by

AgentGuide

Share Agent interview questions and standard answers, offering a one‑stop solution for Agent interviews, backed by senior AI Agent developers from leading tech firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.