What Exactly Is a Token in LLMs? A First‑Principles Explanation
The article explains that a token is the smallest discrete text unit a large language model processes, detailing why tokenization is essential, how tokenizers work, how tokens flow through the transformer, and how token counts affect context windows, cost, latency, and overall model behavior.
1. Why Models Can’t Read Raw Text Directly
Humans interpret a sentence like “北京今天天气不错” as semantic concepts (place, time, state), but for a computer the raw input is just a sequence of Unicode characters or bytes. Neural networks operate on numbers, so the model must first convert the text into stable, discrete units.
From a first‑principles view, a large language model performs three steps:
Split the text into stable discrete units.
Assign each unit an integer ID.
Map the IDs to vectors that are fed into the neural network.
These discrete units are called tokens .
Token is not a “word” in human language; it is the smallest discrete text unit the model uses for computation.
2. What a Token Actually Is
A token corresponds to an entry in the model’s vocabulary. An entry can be a whole word, a sub‑word fragment, a single Chinese character, a punctuation mark, a space‑prefixed token, a code symbol, or even half of an uncommon word.
Token is the fragment that the tokenizer extracts from the original text according to a fixed vocabulary and splitting rules.
3. Why Not Split by Characters or Whole Words
3.1 Character‑level Splitting
Splitting every character creates very long sequences, which increase the quadratic attention cost of the Transformer.
3.2 Word‑level Splitting
Using whole words would require an enormous vocabulary to cover all inflections, new words, misspellings, URLs, file paths, and variable names.
3.3 Sub‑word Splitting as a Compromise
Modern tokenizers use sub‑words, which keep common words as single tokens for efficiency while breaking rare words into multiple pieces to avoid out‑of‑vocabulary problems. This keeps the vocabulary size manageable and works for natural language, code, punctuation, paths, numbers, etc.
Use “sub‑words” instead of characters or whole words as the primary unit.
Common words stay as one token, improving efficiency.
Rare words are split, avoiding OOV.
Vocabulary size stays controllable.
Works for mixed text such as code, URLs, JSON, logs.
4. What a Tokenizer Does
A tokenizer is a “text splitter and encoder”. It performs two core actions:
Split the text into tokens.
Map each token to an integer ID.
Example:
"Hello, world!" → ["Hello", ",", " world", "!"] → [15496, 11, 995, 0]The model never sees the raw string; it sees the sequence of IDs.
Tokenization is the discretization interface before text enters the neural network.
5. How Tokenizers Split Text
Common algorithms include BPE, WordPiece, Unigram, and byte‑level BPE. All follow the principle of learning frequent fragments from large corpora and keeping them as larger tokens while breaking rare fragments into smaller pieces.
Learn which fragments appear most often and keep high‑frequency fragments as larger tokens.
Illustrative example: if the corpus frequently contains "the", "ing", "tion", "http", ".com", "def", "return", the tokenizer will prioritize these as whole tokens, while low‑frequency pieces are composed from smaller units.
5.1 Why Spaces Are Often Tokens
Many tokenizers encode leading spaces as part of the token, allowing the model to distinguish a word at the start of a sentence from the same word in the middle.
The model can more naturally differentiate “word‑initial” and “word‑internal” patterns.
6. Why Different Models Yield Different Token Counts
Token count is a product of both the text and the tokenizer. Different models use different vocabularies, byte‑level options, language biases, code optimizations, and Chinese‑specific merging rules, leading to varying token counts for the same sentence.
Token count = text + tokenizer.
Rough heuristics like “1 token ≈ 4 characters” are only very coarse estimates.
7. Why Token Counts Vary Across Languages and Code
English has stable high‑frequency words and prefixes, so tokenizers can keep many words as single tokens. Chinese lacks spaces and has ambiguous word boundaries, so tokenizers rely more on single characters, bi‑grams, and common phrases. Code contains long identifiers, many symbols, and meaningful whitespace, causing many tokens.
8. What Happens After Tokens Enter the Model
Token IDs are first mapped to high‑dimensional embedding vectors (learned during training). Then positional information is added because the same set of tokens in a different order conveys different meaning.
Embedding = the model’s numeric representation of a token.
With token embeddings and positional encodings, the sequence is fed into the Transformer for attention‑based computation.
9. LLMs Generate Token‑by‑Token, Not Sentence‑by‑Sentence
The training objective is to predict the next most likely token given all previous tokens. Generation proceeds in a loop: predict next token → append it to context → repeat until an end condition.
The model outputs tokens one at a time via autoregressive generation.
10. Tokens as the Model’s Step Size
Each generation step consists of reading the entire context, predicting the next token, appending it, and moving to the next step. More output tokens mean higher latency and cost.
Short answers are faster and cheaper.
Long explanations are slower and more expensive.
11. Why Context Windows Are Measured in Tokens
The context window is the maximum number of tokens the model can attend to in a single forward pass. It includes system prompts, conversation history, tool descriptions, tool results, user input, and previously generated output.
The context window is the token capacity the model can “see” at once.
Longer contexts increase attention computation, latency, memory usage, cost, and noise.
12. Direct Relationship Between Tokens and Cost
Commercial LLM APIs charge per token because token count closely reflects actual compute, memory, and network usage. Input tokens increase the amount of context the model must read; output tokens increase the number of generation steps.
Input tokens → read cost.
Output tokens → generation cost.
From an infrastructure perspective, a token is analogous to a CPU time slice, a byte of storage, or a unit of network traffic.
13. Why Same‑Length Texts Can Have Different Token Costs
Content such as stack traces, UUIDs, base64 strings, minified JSON, file paths, SQL, or long variable names compress poorly, inflating token counts.
Tool schemas, JSON schemas, system prompts, and historical tool results also consume tokens.
Repeating full conversation history linearly accumulates tokens, so production systems often truncate, summarize, compress, retrieve‑only‑relevant parts, or cache.
14. Tokens vs. Model “Intelligence”
Tokens are the basic unit of input, output, and computation. Model intelligence depends on parameters, training data, architecture, objectives, and alignment, not on token count alone. A larger context window merely allows more tokens to be processed in one pass.
It enables the model to handle more tokens in a single computation.
15. Practical Token Management for LLM Systems
15.1 Treat Tokens as a Budget
Each request consumes budgets: input budget, output budget, history budget, and tool budget.
15.2 Prioritize Reducing High‑Noise Content
Overly long system prompts.
Huge tool schemas.
Useless history messages.
Raw logs and large JSON blobs.
15.3 Distinguish “Information to Keep” from “Raw Text to Keep”
Structured summaries.
Key fields.
Conclusions.
Relevant references.
From a token perspective, summarization and compression are major sources of system quality.
15.4 Choose Models and Strategies per Content Type
Long‑form analysis needs careful context budgeting.
Code agents must watch token cost of paths, diffs, and schemas.
Multi‑turn dialogue should prioritize memory compression.
Mature LLM systems always have token‑management awareness at the infrastructure layer.
16. Common Misconceptions
Misconception 1: Token = Word
Incorrect. A token can be a word, sub‑word, punctuation, space prefix, or byte fragment.
Misconception 2: Token = Character
Incorrect. Characters are raw text; tokens are the computational units learned by the tokenizer.
Misconception 3: All Models Count Tokens Identically
Incorrect. Different models use different tokenizers.
Misconception 4: Bigger Context Window Means Smarter Model
Incorrect. It only indicates larger capacity, not reasoning quality.
Misconception 5: Only User Input Consumes Tokens
Incorrect. System prompts, history, tool schemas, tool results, and model outputs all consume tokens.
17. Putting It All Together
The token lifecycle can be visualized as:
flowchart TD
A[Human Text] --> B[Tokenizer]
B --> C[Token Sequence]
C --> D[Token ID]
D --> E[Embedding]
E --> F[Transformer reads full context]
F --> G[Predict next token probability]
G --> H[Select a token]
H --> I[Append back to context]
I --> J[Repeat until stop]Thus, tokens serve three roles:
Discrete representation of text.
Fundamental step size for model computation and generation.
Core metric for context, cost, and latency.
18. Final Takeaway
Token is not a human‑semantic “word”; it is a neural‑network‑oriented “text atom”.
Understanding tokens reveals the true mechanics of LLMs: they process token sequences, generate output token‑by‑token, and manage context, cost, and latency through token budgeting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack Cultivation Path
Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
