What Is a Token in Large Language Models?
The article explains that a token is the unit processed by large language models, describes three common tokenizer methods—word‑level, character‑level, and sub‑word level—with English and Chinese examples, discusses their advantages and limitations, and shows how OpenAI’s tokenizer varies across model versions.
Tokens are the basic input/output units for large language models, produced by a tokenizer that splits text into tokens.
Word Level
English: Word‑level tokenization splits text by spaces or punctuation into words. Example: “Let's try some language processing tasks.” → tokens “Let's, try, some, language, processing, tasks.” When split by punctuation: “Let, ', s, try, some, language, processing, tasks, .”
Chinese: Because Chinese lacks spaces, word‑level tokenization is harder. Example sentence “我们开始学习自然语言处理任务。” may be tokenized as “我们,开始,学习,自然语言处理,任务。” This approach is intuitive but struggles with out‑of‑vocabulary words; tools like jieba use it.
Char Level
English: Character‑level tokenization splits every character. Advantages: much smaller vocabulary and virtually no OOV. Example: “Let's tackle NLP tasks.” → tokens “L, e, t, ', s, t, a, c, k, l, e, N, L, P, t, a, s, k, s, .”
Chinese: Each Chinese character becomes a token. Example: “我们一起学习自然语言处理。” → tokens “我,们,一,起,学,习,自,然,语,言,处,理,。” This method is common in Chinese because each character carries meaning.
Subword Level
English: Sub‑word tokenization lies between character and word levels, breaking rare words into meaningful sub‑words while keeping common words intact. Example: “Let's try some sub‑word tokenization.” → tokens “Let, ’s, try, some, sub, -word, token, ization, .” The marker (shown as a space) indicates a word continuation, helping handle unseen words and reducing vocabulary size while preserving semantic structure.
Chinese: Sub‑word tokenization also splits uncommon Chinese words into smaller parts while keeping frequent words whole. Example sentence “我们学习子词级分词方法。” may be tokenized as “我们,学习,子,词级,分,词,方法,。” This retains semantic information and flexibly handles rare terms.
Testing Tokens
Using the OpenAI tokenizer (e.g., via https://platform.openai.com/tokenizer) shows that different GPT models use different tokenization schemes; newer versions produce fewer tokens and handle Chinese more reasonably.
References
jieba tokenizer: https://github.com/fxsjy/jieba
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
