What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion
This article walks through MiniMind's Attention.forward implementation, explaining why Q, K, and V are created, how tensors are reshaped for multi‑head attention, the role of masks, KV cache, GQA, and how each token aggregates information from the entire context.
What Attention Is Solving
When processing text, a model works on a sequence of tokens —the basic units after tokenization. A single token cannot be understood in isolation; it must consider surrounding tokens to resolve references such as pronouns or causal relations.
Attention's goal: enable the representation of the current position to draw information from the entire context instead of relying solely on its own vector.
This goal breaks down into three concrete questions:
Which positions in the context should the current token attend to?
How much attention should each position receive?
What information is returned from the attended positions?
Q/K/V and the operation Q @ K^T → softmax → @V are built around these questions.
Q, K, and V Explained
Q (Query) : what the current token is looking for.
K (Key) : the searchable features of every token.
V (Value) : the content that a token can provide if it is attended.
Analogously, attention works like a retrieval system: the query (Q) matches keys (K) and pulls the corresponding values (V).
Why Multiple Heads?
One set of Q/K/V can capture only a single perspective on the context. Different relationships—local collocations, long‑range coreference, semantic vs. structural links—require separate views. Multi‑head attention (MHA) splits the large hidden vector into several smaller head_dim vectors, each processed in parallel.
For a hidden size of 512 and 8 heads, each head receives a 64‑dimensional slice:
hidden_size = 512
num_heads = 8
head_dim = 512 / 8 = 64Variations on head allocation include:
MHA : each Q head has its own K and V heads.
GQA (Grouped‑Query Attention) : many Q heads share a smaller set of K/V heads.
MQA (Multi‑Query Attention) : many Q heads share a single K/V pair.
These designs trade expressive freedom for inference efficiency.
Tracking Shapes Through the Forward Pass
The input tensor x arrives as [bsz, seq_len, hidden_size] (batch size, sequence length, hidden dimension). MiniMind first projects x into three tensors:
xq, xk, xv = self.q_proj(x), self.k_proj(x), self.v_proj(x)Each projection is then reshaped into a head‑aware layout:
xq = xq.view(bsz, seq_len, self.n_local_heads, self.head_dim)
xk = xk.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)
xv = xv.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)After reshaping, a transpose swaps the seq_len and head dimensions so that subsequent matrix multiplications can be performed per‑batch and per‑head:
xq = xq.transpose(1, 2)
xk = xk.transpose(1, 2)
xv = xv.transpose(1, 2)Now the shapes are [bsz, heads, seq_len, head_dim] for Q and K, and [bsz, heads, seq_len, head_dim] for V (after possible repeat_kv expansion).
Step 1 – Compute Relevance Scores (Q @ Kᵀ)
With the dimensions aligned, PyTorch treats the leading dimensions as batch dimensions and performs a matrix multiplication on the last two dimensions, yielding a score matrix of shape [seq_len, seq_len] for each head. Each entry is the dot product between a query vector and a key vector, indicating how relevant one position is to another.
Step 2 – Convert Scores to Attention Weights (softmax)
The softmax operation normalizes each row of the score matrix into a probability distribution that sums to 1, turning raw relevance scores into attention weights.
Step 3 – Aggregate Values (weights @ V)
The weighted sum of the value vectors produces the new representation for the current token: output_i = Σ_j weight_{i,j} * V_j. This is the step that truly fuses context into the token.
Masking Mechanisms
Causal mask : blocks future positions during autoregressive training by adding a large negative number (e.g., -inf) to those scores before softmax, forcing their weights to zero.
Attention mask : hides padded or otherwise invalid positions so they do not contribute to attention.
KV Cache for Efficient Generation
During inference, recomputing K and V for all previous tokens is wasteful. The KV cache stores the already‑computed K and V tensors for the history, allowing only the new Q to be generated each step.
Q is not cached because it is specific to the current timestep; only K/V are reusable across future steps.
Appending New Tokens to the Cache
When a new token is generated, its K and V are concatenated to the cached tensors along the sequence dimension ( dim=1), extending the searchable context.
repeat_kv in GQA
GQA reduces the number of independent K/V heads. The repeat_kv operation simply expands the shared K/V tensors so that each Q head can still attend to a compatible K/V shape without creating new independent parameters.
Re‑assembling Multi‑Head Results
After the attention computation, the tensor shape is [bsz, heads, seq_len, head_dim]. It is first transposed back to [bsz, seq_len, heads, head_dim] and then reshaped to [bsz, seq_len, hidden_size]:
output = output.transpose(1, 2).reshape(bsz, seq_len, -1)
output = self.o_proj(output)The final o_proj layer maps the concatenated multi‑head output back to the block's hidden representation, ready for the residual connection, MLP, and the next transformer layer.
Putting It All Together
Project input into Q, K, V.
Reshape and transpose to obtain per‑head tensors.
Compute Q @ K^T → softmax → @V to fuse context.
Apply causal and attention masks to enforce autoregressive and padding constraints.
Cache K/V during generation; use repeat_kv for GQA.
Merge heads back, project to hidden_size, and pass to the next block.
Next up: a deep dive into the MLP that follows attention.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Shi's AI Notebook
AI technology observer documenting AI evolution and industry news, sharing development practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
