Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

AI Programming Lab
AI Programming Lab
AI Programming Lab
Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

Initial: How a sentence becomes tokens

When a prompt such as "大模型的KV缓存是什么" is sent to a large model, it first passes through a tokenizer that splits the text into tokens. English words usually map to a single token, while Chinese characters may become one or two tokens, so the example yields roughly 10‑15 tokens.

Token is the smallest language unit a model can understand; the model does not see characters, only token IDs.

Different models use different tokenizers, so the same 2000‑character article may be 3500 tokens in Claude but only 2800 in DeepSeek, making direct per‑million‑token price comparisons unfair.

Common tokenization methods include Byte‑Pair Encoding, WordPiece, and Unigram tokenization.

Example using the transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
print(tokenizer("Sphinx of black quartz, judge my vow.", return_tensors="pt"))

Inside the Transformer: Token Journey

After tokenization each token receives a numeric ID, which is looked up in a large embedding table (e.g., OpenAI's text‑embedding‑3‑large uses 3072‑dimensional vectors).

Positional information is added via RoPE (Rotary Positional Encoding), enabling long‑context handling.

Tokens then enter the self‑attention layer, where each token is projected to Query (Q), Key (K), and Value (V) vectors. The attention score is computed as the dot product of Q with all K vectors, softmax‑normalized, and used to weight the V vectors.

The computation scales quadratically with token count (1 × 10⁴ tokens → 1 × 10⁸ dot products), explaining why longer contexts are slower and more expensive.

Multi‑head attention runs several attention heads in parallel, each focusing on different linguistic aspects, and their results are concatenated.

After attention, a feed‑forward network (FFN) processes the aggregated information, with residual connections and layer normalization preserving stability across dozens or hundreds of layers (e.g., GLM‑5’s 744 B‑parameter MoE architecture with 256 experts).

All input tokens are processed in parallel, consuming GPU memory proportionally to context length.

Output: Why Generation Costs Several Times More

During generation, tokens are produced autoregressively: each new token attends to all previous tokens, requiring a full forward pass per token. Thus generating 100 tokens needs 100 forward passes, making output tokens far more compute‑intensive than input tokens.

Pricing examples (per million tokens, converted to CNY):

Claude Opus 4.6 – input 34.4 ¥, output 172 ¥ (≈5×)

Claude Sonnet 4.6, GPT‑5.4 – output ≈103 ¥

Gemini 3.1 Pro – output 82.6 ¥

Domestic models: GLM‑5.1 input 3.5 ¥, output 28 ¥ (8×); Qwen 3.6 Plus input 2.1 ¥, output 12.9 ¥ (6×); Kimi K2.5 input 2.63 ¥, output 11.8 ¥; MiniMax M2.7 input 2.06 ¥, output 8.26 ¥; MiMo‑V2‑Pro input 6.88 ¥, output 20.6 ¥.

DeepSeek V3.2 input 0.96 ¥, output 1.93 ¥ – the cheapest by a large margin.

Thus, generating one million tokens with Claude Opus 4.6 costs about 172 ¥, while DeepSeek V3.2 costs only 1.93 ¥, a ~90× difference.

Saving Money: Prompt Caching

When the same system prompt or context is reused across multiple calls, the KV cache can be stored and reused, avoiding recomputation. Cached tokens are billed at a fraction of the normal input price.

Claude series cache read price ≈10 % of normal input (e.g., Opus 4.6 cache hit 3.44 ¥ vs 34.4 ¥).

DeepSeek V3.2 cache hit ≈1 % (0.096 ¥).

Gemini 3.1 Pro cache read ≈25 % (3.44 ¥ from 13.76 ¥).

In agent workflows with dozens or hundreds of LLM calls, prompt caching can cut input costs by 70‑80 %.

Some Overlooked Facts

Models may generate hidden “thinking tokens” during internal reasoning; these are billed but not shown in the final output. Claude Opus 4.6 allows up to 31 999 thinking tokens, which can add noticeable cost.

Context windows are not always fully usable: GPT‑5.4 officially supports 1 M tokens but only opens 272 k by default, and exceeding certain thresholds (e.g., 200 k for Gemini 3.1 Pro) triggers price jumps.

Overall, the article walks through the full lifecycle of a token—from splitting, embedding, attention computation, generation, to billing—illustrating why a $17.62 Claude Code session incurs the costs it does.

Tokenization illustration
Tokenization illustration
RoPE paper
RoPE paper
Attention calculation diagram
Attention calculation diagram
Transformer architecture
Transformer architecture
Input vs output pricing chart
Input vs output pricing chart
Attention variants: MHA, GQA, MQA, MLA
Attention variants: MHA, GQA, MQA, MLA
Author portrait
Author portrait
TransformerLarge Language ModeltokenizationKV cacheprompt cachingLLM pricing
AI Programming Lab
Written by

AI Programming Lab

Sharing practical AI programming and Vibe Coding tips.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.