Artificial Intelligence 20 min read

What Exactly Is a Token in LLMs? A First‑Principles Explanation

The article explains that a token is the smallest discrete text unit a large language model processes, detailing why tokenization is essential, how tokenizers work, how tokens flow through the transformer, and how token counts affect context windows, cost, latency, and overall model behavior.

Full-Stack Cultivation Path

Mar 23, 2026

What Exactly Is a Token in LLMs? A First‑Principles Explanation

1. Why Models Can’t Read Raw Text Directly

Humans interpret a sentence like “北京今天天气不错” as semantic concepts (place, time, state), but for a computer the raw input is just a sequence of Unicode characters or bytes. Neural networks operate on numbers, so the model must first convert the text into stable, discrete units.

From a first‑principles view, a large language model performs three steps:

Split the text into stable discrete units.

Assign each unit an integer ID.

Map the IDs to vectors that are fed into the neural network.

These discrete units are called tokens .

Token is not a “word” in human language; it is the smallest discrete text unit the model uses for computation.

2. What a Token Actually Is

A token corresponds to an entry in the model’s vocabulary. An entry can be a whole word, a sub‑word fragment, a single Chinese character, a punctuation mark, a space‑prefixed token, a code symbol, or even half of an uncommon word.

Token is the fragment that the tokenizer extracts from the original text according to a fixed vocabulary and splitting rules.

3. Why Not Split by Characters or Whole Words

3.1 Character‑level Splitting

Splitting every character creates very long sequences, which increase the quadratic attention cost of the Transformer.

3.2 Word‑level Splitting

Using whole words would require an enormous vocabulary to cover all inflections, new words, misspellings, URLs, file paths, and variable names.

3.3 Sub‑word Splitting as a Compromise

Modern tokenizers use sub‑words, which keep common words as single tokens for efficiency while breaking rare words into multiple pieces to avoid out‑of‑vocabulary problems. This keeps the vocabulary size manageable and works for natural language, code, punctuation, paths, numbers, etc.

Use “sub‑words” instead of characters or whole words as the primary unit.

Common words stay as one token, improving efficiency.

Rare words are split, avoiding OOV.

Vocabulary size stays controllable.

Works for mixed text such as code, URLs, JSON, logs.

4. What a Tokenizer Does

A tokenizer is a “text splitter and encoder”. It performs two core actions:

Split the text into tokens.

Map each token to an integer ID.

Example:

"Hello, world!" → ["Hello", ",", " world", "!"] → [15496, 11, 995, 0]

The model never sees the raw string; it sees the sequence of IDs.

Tokenization is the discretization interface before text enters the neural network.

5. How Tokenizers Split Text

Common algorithms include BPE, WordPiece, Unigram, and byte‑level BPE. All follow the principle of learning frequent fragments from large corpora and keeping them as larger tokens while breaking rare fragments into smaller pieces.

Learn which fragments appear most often and keep high‑frequency fragments as larger tokens.

Illustrative example: if the corpus frequently contains "the", "ing", "tion", "http", ".com", "def", "return", the tokenizer will prioritize these as whole tokens, while low‑frequency pieces are composed from smaller units.

5.1 Why Spaces Are Often Tokens

Many tokenizers encode leading spaces as part of the token, allowing the model to distinguish a word at the start of a sentence from the same word in the middle.

The model can more naturally differentiate “word‑initial” and “word‑internal” patterns.

6. Why Different Models Yield Different Token Counts

Token count is a product of both the text and the tokenizer. Different models use different vocabularies, byte‑level options, language biases, code optimizations, and Chinese‑specific merging rules, leading to varying token counts for the same sentence.

Token count = text + tokenizer.

Rough heuristics like “1 token ≈ 4 characters” are only very coarse estimates.

7. Why Token Counts Vary Across Languages and Code

English has stable high‑frequency words and prefixes, so tokenizers can keep many words as single tokens. Chinese lacks spaces and has ambiguous word boundaries, so tokenizers rely more on single characters, bi‑grams, and common phrases. Code contains long identifiers, many symbols, and meaningful whitespace, causing many tokens.

8. What Happens After Tokens Enter the Model

Token IDs are first mapped to high‑dimensional embedding vectors (learned during training). Then positional information is added because the same set of tokens in a different order conveys different meaning.

Embedding = the model’s numeric representation of a token.

With token embeddings and positional encodings, the sequence is fed into the Transformer for attention‑based computation.

9. LLMs Generate Token‑by‑Token, Not Sentence‑by‑Sentence

The training objective is to predict the next most likely token given all previous tokens. Generation proceeds in a loop: predict next token → append it to context → repeat until an end condition.

The model outputs tokens one at a time via autoregressive generation.

10. Tokens as the Model’s Step Size

Each generation step consists of reading the entire context, predicting the next token, appending it, and moving to the next step. More output tokens mean higher latency and cost.

Short answers are faster and cheaper.

Long explanations are slower and more expensive.

11. Why Context Windows Are Measured in Tokens

The context window is the maximum number of tokens the model can attend to in a single forward pass. It includes system prompts, conversation history, tool descriptions, tool results, user input, and previously generated output.

The context window is the token capacity the model can “see” at once.

Longer contexts increase attention computation, latency, memory usage, cost, and noise.

12. Direct Relationship Between Tokens and Cost

Commercial LLM APIs charge per token because token count closely reflects actual compute, memory, and network usage. Input tokens increase the amount of context the model must read; output tokens increase the number of generation steps.

Input tokens → read cost.

Output tokens → generation cost.

From an infrastructure perspective, a token is analogous to a CPU time slice, a byte of storage, or a unit of network traffic.

13. Why Same‑Length Texts Can Have Different Token Costs

Content such as stack traces, UUIDs, base64 strings, minified JSON, file paths, SQL, or long variable names compress poorly, inflating token counts.

Tool schemas, JSON schemas, system prompts, and historical tool results also consume tokens.

Repeating full conversation history linearly accumulates tokens, so production systems often truncate, summarize, compress, retrieve‑only‑relevant parts, or cache.

14. Tokens vs. Model “Intelligence”

Tokens are the basic unit of input, output, and computation. Model intelligence depends on parameters, training data, architecture, objectives, and alignment, not on token count alone. A larger context window merely allows more tokens to be processed in one pass.

It enables the model to handle more tokens in a single computation.

15. Practical Token Management for LLM Systems

15.1 Treat Tokens as a Budget

Each request consumes budgets: input budget, output budget, history budget, and tool budget.

15.2 Prioritize Reducing High‑Noise Content

Overly long system prompts.

Huge tool schemas.

Useless history messages.

Raw logs and large JSON blobs.

15.3 Distinguish “Information to Keep” from “Raw Text to Keep”

Structured summaries.

Key fields.

Conclusions.

Relevant references.

From a token perspective, summarization and compression are major sources of system quality.

15.4 Choose Models and Strategies per Content Type

Long‑form analysis needs careful context budgeting.

Code agents must watch token cost of paths, diffs, and schemas.

Multi‑turn dialogue should prioritize memory compression.

Mature LLM systems always have token‑management awareness at the infrastructure layer.

16. Common Misconceptions

Misconception 1: Token = Word

Incorrect. A token can be a word, sub‑word, punctuation, space prefix, or byte fragment.

Misconception 2: Token = Character

Incorrect. Characters are raw text; tokens are the computational units learned by the tokenizer.

Misconception 3: All Models Count Tokens Identically

Incorrect. Different models use different tokenizers.

Misconception 4: Bigger Context Window Means Smarter Model

Incorrect. It only indicates larger capacity, not reasoning quality.

Misconception 5: Only User Input Consumes Tokens

Incorrect. System prompts, history, tool schemas, tool results, and model outputs all consume tokens.

17. Putting It All Together

The token lifecycle can be visualized as:

flowchart TD
    A[Human Text] --> B[Tokenizer]
    B --> C[Token Sequence]
    C --> D[Token ID]
    D --> E[Embedding]
    E --> F[Transformer reads full context]
    F --> G[Predict next token probability]
    G --> H[Select a token]
    H --> I[Append back to context]
    I --> J[Repeat until stop]

Thus, tokens serve three roles:

Discrete representation of text.

Fundamental step size for model computation and generation.

Core metric for context, cost, and latency.

18. Final Takeaway

Token is not a human‑semantic “word”; it is a neural‑network‑oriented “text atom”.

Understanding tokens reveals the true mechanics of LLMs: they process token sequences, generate output token‑by‑token, and manage context, cost, and latency through token budgeting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Transformer cost management Embedding tokenization context window

Written by

Full-Stack Cultivation Path

Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.