Artificial Intelligence 21 min read

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

This article walks through MiniMind's Attention.forward implementation, explaining why Q, K, and V are created, how tensors are reshaped for multi‑head attention, the role of masks, KV cache, GQA, and how each token aggregates information from the entire context.

Shi's AI Notebook

Mar 16, 2026

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

What Attention Is Solving

When processing text, a model works on a sequence of tokens —the basic units after tokenization. A single token cannot be understood in isolation; it must consider surrounding tokens to resolve references such as pronouns or causal relations.

Attention's goal: enable the representation of the current position to draw information from the entire context instead of relying solely on its own vector.

This goal breaks down into three concrete questions:

Which positions in the context should the current token attend to?

How much attention should each position receive?

What information is returned from the attended positions?

Q/K/V and the operation Q @ K^T → softmax → @V are built around these questions.

Q, K, and V Explained

Q (Query) : what the current token is looking for.

K (Key) : the searchable features of every token.

V (Value) : the content that a token can provide if it is attended.

Analogously, attention works like a retrieval system: the query (Q) matches keys (K) and pulls the corresponding values (V).

Why Multiple Heads?

One set of Q/K/V can capture only a single perspective on the context. Different relationships—local collocations, long‑range coreference, semantic vs. structural links—require separate views. Multi‑head attention (MHA) splits the large hidden vector into several smaller head_dim vectors, each processed in parallel.

For a hidden size of 512 and 8 heads, each head receives a 64‑dimensional slice:

hidden_size = 512
num_heads = 8
head_dim = 512 / 8 = 64

Variations on head allocation include:

MHA : each Q head has its own K and V heads.

GQA (Grouped‑Query Attention) : many Q heads share a smaller set of K/V heads.

MQA (Multi‑Query Attention) : many Q heads share a single K/V pair.

These designs trade expressive freedom for inference efficiency.

Tracking Shapes Through the Forward Pass

The input tensor x arrives as [bsz, seq_len, hidden_size] (batch size, sequence length, hidden dimension). MiniMind first projects x into three tensors:

xq, xk, xv = self.q_proj(x), self.k_proj(x), self.v_proj(x)

Each projection is then reshaped into a head‑aware layout:

xq = xq.view(bsz, seq_len, self.n_local_heads, self.head_dim)
xk = xk.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)
xv = xv.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim)

After reshaping, a transpose swaps the seq_len and head dimensions so that subsequent matrix multiplications can be performed per‑batch and per‑head:

xq = xq.transpose(1, 2)
xk = xk.transpose(1, 2)
xv = xv.transpose(1, 2)

Now the shapes are [bsz, heads, seq_len, head_dim] for Q and K, and [bsz, heads, seq_len, head_dim] for V (after possible repeat_kv expansion).

Step 1 – Compute Relevance Scores (Q @ Kᵀ)

With the dimensions aligned, PyTorch treats the leading dimensions as batch dimensions and performs a matrix multiplication on the last two dimensions, yielding a score matrix of shape [seq_len, seq_len] for each head. Each entry is the dot product between a query vector and a key vector, indicating how relevant one position is to another.

Step 2 – Convert Scores to Attention Weights (softmax)

The softmax operation normalizes each row of the score matrix into a probability distribution that sums to 1, turning raw relevance scores into attention weights.

Step 3 – Aggregate Values (weights @ V)

The weighted sum of the value vectors produces the new representation for the current token: output_i = Σ_j weight_{i,j} * V_j. This is the step that truly fuses context into the token.

Masking Mechanisms

Causal mask : blocks future positions during autoregressive training by adding a large negative number (e.g., -inf) to those scores before softmax, forcing their weights to zero.

Attention mask : hides padded or otherwise invalid positions so they do not contribute to attention.

KV Cache for Efficient Generation

During inference, recomputing K and V for all previous tokens is wasteful. The KV cache stores the already‑computed K and V tensors for the history, allowing only the new Q to be generated each step.

Q is not cached because it is specific to the current timestep; only K/V are reusable across future steps.

Appending New Tokens to the Cache

When a new token is generated, its K and V are concatenated to the cached tensors along the sequence dimension ( dim=1), extending the searchable context.

repeat_kv in GQA

GQA reduces the number of independent K/V heads. The repeat_kv operation simply expands the shared K/V tensors so that each Q head can still attend to a compatible K/V shape without creating new independent parameters.

Re‑assembling Multi‑Head Results

After the attention computation, the tensor shape is [bsz, heads, seq_len, head_dim]. It is first transposed back to [bsz, seq_len, heads, head_dim] and then reshaped to [bsz, seq_len, hidden_size]:

output = output.transpose(1, 2).reshape(bsz, seq_len, -1)
output = self.o_proj(output)

The final o_proj layer maps the concatenated multi‑head output back to the block's hidden representation, ready for the residual connection, MLP, and the next transformer layer.

Putting It All Together

Project input into Q, K, V.

Reshape and transpose to obtain per‑head tensors.

Compute Q @ K^T → softmax → @V to fuse context.

Apply causal and attention masks to enforce autoregressive and padding constraints.

Cache K/V during generation; use repeat_kv for GQA.

Merge heads back, project to hidden_size, and pass to the next block.

Next up: a deep dive into the MLP that follows attention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning transformer attention multi-head attention kv cache gqa qkv

Written by

Shi's AI Notebook

AI technology observer documenting AI evolution and industry news, sharing development practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

What Attention Is Solving

Q, K, and V Explained

Why Multiple Heads?

Tracking Shapes Through the Forward Pass

Step 1 – Compute Relevance Scores (Q @ Kᵀ)

Step 2 – Convert Scores to Attention Weights (softmax)

Step 3 – Aggregate Values (weights @ V)

Masking Mechanisms

KV Cache for Efficient Generation

Appending New Tokens to the Cache

repeat_kv in GQA

Re‑assembling Multi‑Head Results

Putting It All Together

Shi's AI Notebook

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Compute Relevance Scores (Q @ Kᵀ)

Step 2 – Convert Scores to Attention Weights (softmax)

Step 3 – Aggregate Values (weights @ V)