Artificial Intelligence 7 min read

AI Comic Episode 3: What Exactly Is a Token?

This episode explains that a token is the smallest text chunk an LLM processes—ranging from characters to subwords—covers why subword tokenization avoids vocabulary explosion, compares token counts across languages, describes the computational cost of sequential generation, and introduces visual tokens for multimodal models.

ShiZhen AI

Dec 1, 2025

AI Comic Episode 3: What Exactly Is a Token?

This article is the third episode of the "AI Small Classroom" comic series, focusing on the concept of tokens in large language models.

ROUND 1

It starts with a playful question: "Can a token be eaten?" and promises a quick understanding of AI.

ROUND 2

Token is defined as the smallest unit a large language model uses to process text. It can be a character, word, sub‑word, or even punctuation, and the model “eats” these tokens to understand and generate language.

ROUND 3

Using whole words as tokens would cause the vocabulary to explode, make rare words unmanageable, waste compute, and fail to handle multilingualism and morphological variations. Sub‑word tokens are likened to Lego blocks—flexible, material‑efficient, and capable of constructing any word.

ROUND 4

Modern models do not treat the whole word as a single token; they split it into two or more pieces, e.g., "token", "iza", "tion" or the more common "token", "ization". Consequently, a vocabulary of only a few tens of thousands can compose millions or even unlimited words.

ROUND 5

When feeding a prompt, the entire input is processed in one large matrix multiplication, which is efficient. However, during generation each new token must recompute attention over all previous tokens, repeating the work and incurring high memory and compute costs, especially for long sequences.

ROUND 6

Output generation is serial: each token depends on all previously generated tokens, requiring a fresh attention computation each time. Example: input "Who are you?" yields output "I am. I am Shi. I am Shi Zhen..." illustrating the repeated computation.

ROUND 7

In Chinese, one character roughly equals one token, while an English token typically contains 4–5 letters or 1.5–2 words. Therefore, expressing the same meaning in Chinese often requires 50%–100% more tokens than in English.

ROUND 8

The token count depends on the tokenizer. Dominant English‑centric tokenizers (GPT, LLaMA, Grok) handle Chinese poorly, whereas many Chinese‑friendly tokenizers used by domestic models can cut token counts by 30%–50%.

ROUND 9

Multimodal models first split an image into patches/tiles, encode them into "visual tokens" ranging from dozens to thousands, and concatenate them with text tokens before feeding the Transformer. Deepseek OCR compresses a document image into 100–800 visual tokens—over ten times fewer than GPT‑4V—while accurately outputting markdown with tables, formulas, and layout, making it a leading open‑source document‑understanding tool in 2025.

ROUND 10

In summary, a token is the "electron" of the AI world; the entire AI ecosystem is built by assembling these token “Lego bricks”.

large language models tokenization multimodal AI fundamentals subword visual token

Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.