Artificial Intelligence 25 min read

Deep Dive into Transformer Mechanics: Scaling, Q/K Projections, FFNs, and More

This article provides concise technical explanations for 25 common questions about Transformer models, covering scaled dot‑product attention scaling, separate Q/K projections, feed‑forward network design, attention variants, normalization, LoRA versus full‑parameter training, KV‑cache, pre‑ and post‑norm, computational cost analysis, and advanced position‑encoding techniques.

Baobao Algorithm Notes

May 5, 2024

Deep Dive into Transformer Mechanics: Scaling, Q/K Projections, FFNs, and More

1. Why does scaled dot‑product attention divide the QK inner product by \(\sqrt{d}\)?

Dividing by \(\sqrt{d}\) compresses the logits before the softmax, preventing them from entering the saturation region where gradients become vanishingly small. The same scaling effect can be obtained by initializing the projection weights with variance \(1/d\), which yields logits of comparable magnitude.

2. Why are different linear projection matrices used for Q and K in self‑attention?

If Q and K share the same matrix the resulting similarity matrix is symmetric, which reduces model expressiveness and produces large diagonal values that cause each token to attend excessively to itself. Separate projection matrices increase the number of parameters and allow the model to learn asymmetric similarity functions.

3. Why does the first FFN layer expand the hidden dimension while the second reduces it back?

The expansion (often by a factor of 4) acts like a kernel transformation: it lifts the representation to a higher‑dimensional space where non‑linear relationships become linearly separable, and it also provides many more trainable parameters, increasing model capacity. The subsequent reduction restores the original dimensionality so that residual connections can be added without extra reshaping.

4. How do Multi‑Query Attention (MQA) and Grouped‑Query Attention (GQA) compare to Multi‑Head Attention (MHA) in computation?

MQA/GQA use a single set of key/value (KV) projections shared across multiple query heads, so the number of trainable parameters is lower than in MHA. The overall FLOP count is similar because the QK inner‑product and softmax dominate the cost; the main savings appear in the KV generation stage. In decoder‑only models that cache KV tensors, fewer KV heads reduce cache size, enabling longer context windows.

5. Why are most modern LLMs decoder‑only, and why does unidirectional attention often outperform bidirectional attention?

Bidirectional attention tends to degenerate into low‑rank matrices during deep training, limiting capacity. Decoder‑only models use causal (lower‑triangular) attention, which remains full‑rank and thus preserves modeling power. Unidirectional training forces the model to learn richer representations, provides an implicit positional bias, and enables efficient KV‑caching during inference.

6. Why can token embeddings and absolute position encodings be added directly in BERT?

Adding two vectors is mathematically equivalent to concatenating them followed by a linear projection, but addition is computationally cheaper. In high‑dimensional spaces (e.g., 768‑D) random vectors are nearly orthogonal, so the sum retains most information. Both embeddings can be viewed as linear transforms of one‑hot vectors, making simple addition a stable operation.

7. How do LoRA and full‑parameter training differ in compute and memory, and why does LoRA speed up training?

LoRA inserts low‑rank adapters into selected linear layers. The forward and backward passes incur a modest overhead for the adapters, so compute is slightly higher than full‑parameter training. Memory usage drops because optimizer states are stored only for the adapter parameters, not for the entire backbone. This reduction allows larger batch sizes and lowers inter‑GPU communication. If the backbone is frozen, it can also be quantized (e.g., int8/int4), further saving memory.

8. Why is normalization (batchnorm or layernorm) needed?

Features often have different scales; normalization rescales them to a comparable range, preventing any single feature from dominating.

BatchNorm mitigates internal covariate shift by stabilizing the distribution of layer inputs, smoothing the loss landscape.

LayerNorm normalizes across the feature dimension of each token, making it robust to variable sequence lengths and padding.

9. What are the pros and cons of pre‑norm versus post‑norm in Transformers?

Post‑norm (original design) normalizes after the residual addition. It provides stronger regularization but can cause gradient vanishing in very deep networks.

Pre‑norm normalizes before the residual branch, which alleviates gradient vanishing and enables training of deeper models, though it may slightly reduce the effective depth.

10. Compute the FLOPs of a self‑attention module (hidden size \(D\), heads \(h\), head dim \(d\) with \(D = h\times d\), sequence length \(s\), batch size 1).

QKV linear projection: 6 × s × D² FLOPs.

QK inner product: h × 2 × d × s² FLOPs.

Scaling factor: h × s² FLOPs.

Softmax (exp, sum, division): h × 3 × s² FLOPs.

Weighted sum (attention × V): h × 2 × d × s² FLOPs.

11. Advantages and disadvantages of RoPE (rotary position encoding)

Advantages : implements relative positioning via additive encoding without changing the attention formula; low computational cost; compatible with linear attention; naturally decays with distance, focusing on nearby tokens.

Disadvantages : extrapolation beyond the training context length is weaker than methods such as Alibi; often requires additional tricks (linear/NTK interpolation, YaRN) to extend context.

12. How does batchnorm momentum affect training?

During training the moving averages are updated as:

moving_mean = momentum × moving_mean + (1‑momentum) × batch_mean
moving_var  = momentum × moving_var  + (1‑momentum) × batch_var

A small momentum updates the statistics quickly but introduces higher variance; a large momentum updates slowly, which may leave the estimates inaccurate by the end of training, especially with small batch sizes.

13. Why does multi‑head attention outperform single‑head attention?

Multiple heads attend to different sub‑spaces simultaneously, capturing diverse linguistic patterns (syntax, semantics, local vs. long‑range dependencies). This increases expressive power while keeping per‑head dimensionality low, and the parallelism cost is modest.

14. Why does KV‑cache accelerate inference?

In decoder‑only models each new token only attends to previously generated tokens. By caching the projected K and V for earlier positions, the model avoids recomputing them at every generation step, reducing per‑step cost dramatically. The cache is stored in fast on‑chip memory (e.g., L2), further speeding up inference.

15. Pros and cons of ReLU

Pros : cheap computation ( max(0, x)); gradient is 1 for positive inputs, avoiding vanishing/exploding gradients.

Cons : output mean is not zero, causing activation shift; neurons can die when inputs are negative because gradients become zero.

16. Why do Transformers use layernorm instead of batchnorm?

Text sequences have variable lengths and padding. Batchnorm would normalize across unrelated tokens in a batch, harming performance. Layernorm normalizes across the feature dimension of each token, making it robust to padding and sequence‑length variations. Empirical studies show batchnorm leads to unstable statistics in NLP tasks.

17. How do encoder and decoder interact in a Transformer?

Each decoder layer first performs self‑attention on its own inputs. Then it uses the encoder’s final hidden states as keys and values while the decoder’s current hidden state serves as the query for cross‑attention.

18. Difference between PyTorch view() and reshape()

view()

returns a tensor view that shares the same storage; the tensor must be contiguous. reshape() returns a contiguous tensor; if the original tensor is non‑contiguous it copies the data, otherwise it behaves like view().

19. Models used in RLHF with PPO and their roles

Actor : initialized from the supervised‑fine‑tuned (SFT) model; generates actions.

Reference : frozen copy of the SFT model; provides a KL‑penalty baseline.

Reward : pretrained model that scores SFT outputs; frozen during PPO.

Critic : initialized from the reward model; trainable; predicts expected return of the actor’s actions.

20. Main memory consumers during GPT‑style model training

Assume a model with \(L\) layers, vocabulary size \(V\), hidden size \(H\), batch size \(B\), sequence length \(S\), and \(N\) attention heads. Using mixed‑precision (half‑precision) Adam:

Parameters : \(\Phi = V\!H + L(12H^2 + 13H)\) elements → \(2\Phi\) bytes.

Gradients : same size as parameters → \(2\Phi\) bytes.

Optimizer states (first‑ and second‑moment estimates): roughly \(16\Phi\) bytes.

Activations : \(34BSH + 5BN S^2\) elements → \(2\) bytes each in half‑precision.

For GPT‑3 (\(H=12288, L=96, N=96\)), parameters occupy ~350 GB. With \(S=1024\) activations require ~90 GB; with \(S=8192\) they exceed 3 TB.

21. Differences between bf16 and fp16 in half‑precision training

Both use 16 bits. fp16 has 1 sign, 5 exponent, 10 mantissa bits – higher precision but smaller dynamic range. bf16 has 1 sign, 8 exponent, 7 mantissa bits – larger range, lower precision. bf16 reduces overflow risk for large values; fp16 can yield slightly better accuracy for sensitive models.

22. Idea behind NTK‑aware interpolation for extending context length

Linear interpolation inserts new position encodings between existing ones, expanding the positional range. NTK‑aware interpolation modifies the RoPE base so that high‑frequency components are extrapolated while low‑frequency components are interpolated, preserving signal fidelity across longer contexts.

23. How NTK‑by‑parts improves on NTK‑aware interpolation

NTK‑by‑parts treats RoPE components differently based on their wavelength relative to the context length:

Very short wavelengths (≤ 1/32 of context) are only extrapolated.

Very long wavelengths (≥ context length) are only extrapolated.

Intermediate wavelengths receive a weighted combination of extrapolation and interpolation, defined by a piecewise ramp function.

24. How YaRN extends context length

YaRN combines NTK‑by‑parts with a temperature scaling factor \(t>1\) applied to the attention scores before softmax. Because RoPE is a rotation matrix, scaling it by \(\sqrt{t}\) widens the effective positional range without changing the attention code.

25. KV‑cache memory requirement for a model using Group‑Query Attention

For hidden size \(D\), \(h\) heads, head dimension \(d = D/h\), \(n\) KV groups, sequence length \(s\), batch size \(b\), and \(L\) layers, the cache stores \(2LnsD/h\) values. Using half‑precision (2 bytes per value) the total memory is \(4bLnsD/h\) bytes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Transformer LoRA attention normalization

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.