How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

This article walks through the complete workflow of AI large language models, from turning user queries into token matrices via tokenization and embedding, through the Transformer’s self‑attention and multi‑head mechanisms, to decoding logits into human‑readable text, while also covering position encoding, long‑context strategies, generation parameters, and practical engineering tips.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

In the era of AI, understanding how large language models (LLMs) process a user’s question and generate a response is essential for effective use.

1. Input: From User Question to a Matrix the Model Can "Read"

The model receives a combined text called context , which includes system prompts, tool descriptions, conversation history, and the latest user query.

messages = [
    {"role": "system", "content": "你是个智能助手,回答时要可爱些"}, // system prompt
    {"role": "user", "content": "你好"}, // previous question
    {"role": "assistant", "content": "你好,有什么能帮到你呀"}, // previous answer
    {"role": "user", "content": "查询下今日天气"} // latest question
]

tools = [{"type":"function","function":{"name":"get_weather","description":"Get current weather information"}}]
Note: Each call to the model is independent; the growing context across a conversation is what enables continuity.

2. Tokenization & Embedding: Turning Text into Numbers

Tokenization splits text into smaller units (tokens). For Chinese, "北京" may become one token; for English, "unhappy" becomes "un" and "happy".

Each token is mapped to a numeric ID using a vocabulary of tens of thousands of entries.

Embedding uses a learned matrix to convert each ID into a fixed‑dimensional vector (e.g., a 512‑dimensional vector).

The result is an n × 512 matrix where n is the number of tokens.

Note: The token length n equals the context length that the model processes.

3. Context Length Limits

LLMs have strict context windows (e.g., DeepSeek‑V3 supports up to 128 k tokens, but the usable input is 124 k after reserving space for output). Exceeding the limit triggers errors, so implementations often drop the oldest tokens.

Transformer layer diagram
Transformer layer diagram

4. Transformer Architecture & Self‑Attention

The core of the model is the Transformer, whose essential component is the self‑attention mechanism.

4.1 Self‑Attention: How the Model "Focuses" on Important Information

Query (Q) matrix: Represents what information the token seeks.

Key (K) matrix: Represents what each token contains.

Value (V) matrix: Holds the actual content to be passed on.

Attention scores are computed as the dot product of Q and K, then weighted V vectors are summed to produce the token’s contextual representation.

Compute attention scores using Q·Kᵀ.

Apply the scores as weights to V and sum to obtain the aggregated output.

Note: The final token’s attention aggregates information from the entire context.

4.2 Multi‑Head Attention: Multiple Perspectives

Several self‑attention heads run in parallel, each with its own Q, K, V matrices, and their outputs are concatenated and linearly transformed, allowing the model to capture diverse relational patterns.

4.3 Feed‑Forward Network (FFN)

After attention, each token passes through an FFN that further processes the aggregated information, analogous to a human reflecting on a discussion.

Transformer layer with attention and FFN
Transformer layer with attention and FFN

5. From Hidden States to Human‑Readable Text

5.1 Linear Projection to Vocabulary Space

The hidden state vectors are linearly projected onto the vocabulary dimension, producing raw scores (logits) for each possible token.

[2.1, -0.3, 1.8, ..., 0.02]

5.2 Softmax: Converting Logits to Probabilities

Softmax normalizes logits into a probability distribution, e.g., [0.15, 0.02, 0.25, ..., 0.001], from which a token is sampled.

5.3 Autoregressive Generation

Predict the first token’s probability distribution.

Sample a token (often the highest‑probability one).

Append the token to the context and repeat.

Stop when an end‑of‑sequence token appears or a length limit is reached.

Note: Each new token’s prediction includes the entire previous context, which is why output length counts toward the context window.

6. Position Encoding & Long‑Text Extrapolation

6.1 Position Encoding

Since attention alone loses order information, position encodings are added to token embeddings. Two main types:

Absolute encoding: Unique identifiers for each position; performance drops when exceeding training lengths.

Relative encoding (e.g., RoPE): Encodes the distance between tokens, allowing better generalization to longer inputs.

Note: Relative encodings decay attention scores for distant tokens, naturally focusing on nearby context.

6.2 Long‑Text Strategies

Even relative encodings have limits. Common approaches include:

Interpolation methods (e.g., YaRN) that map longer distances into the model’s familiar range.

Sliding‑window or selective attention that restricts computation to a fixed window, reducing cost at the expense of some information.

6.3 Training for Long Contexts

Typical workflow: pre‑train on short sequences (2‑8 k tokens) then fine‑tune on a smaller set of longer texts (e.g., 32 k → 128 k). This leverages learned language abilities while extending context handling.

7. Practical Engineering Guidance

7.1 Multimodal Input

Models like DeepSeek‑V3 primarily accept text; image inputs are usually processed by a separate vision encoder, whose textual description is then fed to the LLM.

Multimodal example
Multimodal example

7.2 Reducing Context for Faster Inference

Trim unnecessary system prompts and tool descriptions.

Limit the amount of historical dialogue retained; retrieve only relevant past turns.

Use multiple specialized agents (sub‑Agents) to split a large task into smaller contexts, reducing the quadratic cost of attention.

For example, a 12 k token request can be broken into four 3 k token sub‑requests, reducing total compute from 12²=144 to 4×3²=36 (relative units).

7.3 Generation Parameters

Adjust temperature to control randomness (lower values make output more deterministic) and top‑p to limit sampling to the most probable tokens.

7.4 Latency Considerations

The first token’s latency grows with the square of the context length; subsequent tokens depend on the cached context but still increase with output length. Reducing context size and output length directly lowers response time.

8. Conclusion

The article provides a systematic overview of how LLMs transform textual input into mathematical matrices, process them through Transformer layers, and finally decode them into natural language, while also offering concrete engineering tricks for handling long contexts, multimodal inputs, and inference efficiency.

References

从零学习大模型(5)——位置编码:让 AI 读懂 "语序" 的关键技术 | 人人都是产品经理. https://www.woshipm.com/ai/6247208.html

Transformer位置编码详解. 知乎专栏. https://zhuanlan.zhihu.com/p/675243992

大模型系列:快速通俗理解Transformer旋转位置编码RoPE. CSDN博客. https://blog.csdn.net/datian1234/article/details/143208606

Transformer-1: 词嵌入与位置编码. CSDN博客. https://blog.csdn.net/m0_75108877/article/details/147929963

Transformer详解. 知乎专栏. https://zhuanlan.zhihu.com/p/607423406

图解Transformer. CSDN博客. https://blog.csdn.net/qq_41664845/article/details/84969266

一文深入了解DeepSeek‑R1:模型架构. 腾讯云开发者社区. https://cloud.tencent.com/developer/article/2496104

Vaswani, A., et al. The Annotated Transformer. Harvard NLP. https://nlp.seas.harvard.edu/2018/04/03/attention.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerInference Optimizationlarge language modelstokenizationSelf-attentionposition encoding
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.