Artificial Intelligence 5 min read

What Is a Token in Large Language Models?

The article explains that a token is the unit processed by large language models, describes three common tokenizer methods—word‑level, character‑level, and sub‑word level—with English and Chinese examples, discusses their advantages and limitations, and shows how OpenAI’s tokenizer varies across model versions.

Infra Learning Club

Oct 31, 2024

What Is a Token in Large Language Models?

Tokens are the basic input/output units for large language models, produced by a tokenizer that splits text into tokens.

Word Level

English: Word‑level tokenization splits text by spaces or punctuation into words. Example: “Let's try some language processing tasks.” → tokens “Let's, try, some, language, processing, tasks.” When split by punctuation: “Let, ', s, try, some, language, processing, tasks, .”

Chinese: Because Chinese lacks spaces, word‑level tokenization is harder. Example sentence “我们开始学习自然语言处理任务。” may be tokenized as “我们，开始，学习，自然语言处理，任务。” This approach is intuitive but struggles with out‑of‑vocabulary words; tools like jieba use it.

Char Level

English: Character‑level tokenization splits every character. Advantages: much smaller vocabulary and virtually no OOV. Example: “Let's tackle NLP tasks.” → tokens “L, e, t, ', s, t, a, c, k, l, e, N, L, P, t, a, s, k, s, .”

Chinese: Each Chinese character becomes a token. Example: “我们一起学习自然语言处理。” → tokens “我，们，一，起，学，习，自，然，语，言，处，理，。” This method is common in Chinese because each character carries meaning.

Subword Level

English: Sub‑word tokenization lies between character and word levels, breaking rare words into meaningful sub‑words while keeping common words intact. Example: “Let's try some sub‑word tokenization.” → tokens “Let, ’s, try, some, sub, -word, token, ization, .” The marker (shown as a space) indicates a word continuation, helping handle unseen words and reducing vocabulary size while preserving semantic structure.

Chinese: Sub‑word tokenization also splits uncommon Chinese words into smaller parts while keeping frequent words whole. Example sentence “我们学习子词级分词方法。” may be tokenized as “我们，学习，子，词级，分，词，方法，。” This retains semantic information and flexibly handles rare terms.

Testing Tokens

Using the OpenAI tokenizer (e.g., via https://platform.openai.com/tokenizer) shows that different GPT models use different tokenization schemes; newer versions produce fewer tokens and handle Chinese more reasonably.

References

jieba tokenizer: https://github.com/fxsjy/jieba

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models tokenization token NLP jieba subword character-level word-level

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.