How Self-Attention Powers LLMs: A Step‑by‑Step Deep Dive
This article explains the self‑attention mechanism behind large language models, detailing why static word importance fails, how queries, keys, and values are generated, how attention scores are computed, scaled, softmaxed, and used to produce context‑aware word vectors, while noting computational costs.
Why Self‑Attention Is Needed
In language modeling the next token depends on all previous tokens, not just the immediate predecessor. The relevance of each word changes with context, so a static weight matrix cannot capture these dynamic relationships. Self‑attention provides a mechanism for every token to weigh the importance of every other token in the sequence.
What Self‑Attention Does
Self‑attention updates each token’s vector representation by aggregating information from all other tokens. For each token three vectors are produced:
Query (Q) : what the token wants to focus on.
Key (K) : how the token can be attended to by others.
Value (V) : the actual content carried forward.
Self‑Attention Computation
The computation consists of four deterministic steps applied to the whole sequence.
Generate Q, K, V – Apply three learned linear projections to the input embedding matrix X ∈ ℝ^{n×d}:
Q = X W_q
K = X W_k
V = X W_vwhere W_q, W_k, W_v ∈ ℝ^{d×d_k} are trainable weight matrices.
Compute raw attention scores – Take the dot product between each query and all keys: scores = Q Kᵀ // shape (n, n) Scale and apply Softmax – Divide the scores by √d_k to stabilise gradients and convert them into a probability distribution:
weights = softmax(scores / sqrt(d_k)) // each row sums to 1Weighted sum of values – Multiply the weight matrix by the value matrix to obtain the final context‑aware representations: output = weights V // shape (n, d_k) The resulting output replaces the original token embeddings and is fed to subsequent layers (e.g., feed‑forward networks, additional attention heads).
Illustrative Example
Consider the sentence “the cat sat”. Assume each word is represented by a 3‑dimensional embedding vector (for simplicity). After the linear projections we obtain nine vectors (Q, K, V for each word). The attention scores for the token “sat” are computed by dot‑product of its query with the keys of “the”, “cat”, and “sat”. After scaling and Softmax, the weights might look like [0.1, 0.7, 0.2], indicating that “sat” attends most strongly to “cat”. The final vector for “sat” is the weighted sum of the three value vectors, integrating information from the whole sentence. Visual summary of the pipeline:
Key Characteristics
Dynamic weighting : each token’s representation is conditioned on the entire sequence.
Parallelizable : all Q, K, V matrices are computed with matrix multiplications, enabling efficient GPU execution.
Quadratic cost : the attention matrix has size n × n, leading to O(n²) time and memory, which becomes a bottleneck for very long sequences.
Practical Notes
Typical implementations split the embedding dimension into multiple heads (multi‑head attention) to capture diverse relational patterns.
Layer normalization and residual connections are applied around the attention sub‑layer to stabilise training.
For long inputs, variants such as sparse attention, sliding‑window attention, or linear‑complexity approximations are used to mitigate the quadratic cost.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
