How QK-Clip Tames MaxLogit Explosions in Trillion‑Parameter LLMs
The article introduces QK-Clip, a lightweight per‑head weight‑clipping technique that uses the MaxLogit signal to prevent uncontrolled logit growth in massive LLMs, explains its design, compares it with prior methods, and shows that it stabilizes training without harming model performance.
Problem Definition
During training of extremely large language models, the maximum logit (MaxLogit) before the softmax can increase linearly or super‑linearly, eventually exploding. MaxLogit is the largest entry of the pre‑softmax attention matrix (including the batch dimension) and serves as an outlier indicator. When it grows without bound, it can cause gradient spikes or training crashes, especially for models with billions of parameters where RMSNorm does not fully control the spectral norm of the weight matrices.
Prior Mitigations
Weight decay can partially suppress MaxLogit, but for very large models the effect is limited and may hurt performance.
Direct clipping of logits (e.g., bounded activation from Gemma2) guarantees bounded logits after softmax but does not control the pre‑softmax values, merely moving the problem upstream.
QK‑Norm effectively limits MaxLogit for multi‑head attention (MHA) and GQA, but it requires materializing Q and K matrices during decoding and therefore cannot be applied to multi‑query‑layer attention (MLA).
QK‑Clip Method
QK‑Clip treats MaxLogit itself as a trigger. After each optimizer update, for every attention head the current MaxLogit m is measured. If m exceeds a predefined threshold T, a scaling factor c = T / m (where c < 1) is computed and the corresponding Q or K weight matrix (or sub‑matrix) is multiplied by c. This operation reduces the MaxLogit just enough to bring it below T while leaving the forward inference path unchanged.
For MLA the weight matrix is split into four sub‑matrices (qr, qc, kr, kc). Only the qr part, which is used in the query side during decoding, receives the scaling factor; the shared kr part is left untouched to avoid collateral damage.
Fine‑Tuning Details
Per‑head monitoring: instead of a single global MaxLogit, each head’s MaxLogit is tracked independently. Only heads that exceed T are clipped, preventing “over‑clipping” of stable heads.
Separate handling of MLA sub‑matrices: the clipping factor is applied solely to qr (and analogously to qc when needed), while shared components kr and kc are not scaled.
Experimental Results
In the trillion‑parameter model Kimi K2, a threshold T = 100 was used. MaxLogit spikes appeared around step 7 k; with QK‑Clip combined with the Muon optimizer the values remained bounded. After roughly 70 k steps all heads naturally fell below the threshold, and QK‑Clip ceased to have effect. Smaller‑scale experiments showed that aggressive clipping (e.g., forcing MaxLogit to 30) did not degrade downstream performance, confirming the method’s loss‑less nature.
The technique also works with Adam (referred to as “AdamClip”) and can be combined with other stability tricks.
Theoretical Insight
The Muon optimizer computes update matrices that are full‑rank, increasing the probability of singular‑value collisions that amplify the spectral norm of Q/K matrices. This makes MaxLogit explosions more likely compared with Adam, whose updates tend to be low‑rank (effective rank < full). Consequently, Muon‑trained models exhibit higher singular‑value entropy, which correlates with the observed instability.
Implementation Considerations
Per‑head clipping requires access to the full Q/K weight matrices after each optimizer step. In distributed training the matrices are often sharded, so an additional synchronization step is needed to compute per‑head MaxLogit and apply the scaling factor.
References
[1]Moonlight: https://kexue.fm/archives/10739 [2] Muon optimizer: https://kexue.fm/archives/10592 [3] Kimi K2: https://moonshotai.github.io/Kimi-K2/ [4] Muon sequel analysis: https://kexue.fm/archives/10739 [5] Gemma2: https://arxiv.org/abs/2408.00118 [6] Gemma3: https://arxiv.org/abs/2503.19786 [7] MLA part 1: https://kexue.fm/archives/10907 [8] MLA part 2: https://kexue.fm/archives/11111 [9] Singular‑value clipping: https://kexue.fm/archives/11006 [10] DeepSeek‑V3: https://arxiv.org/abs/2412.19437 [11] Effective rank: https://kexue.fm/archives/10847 [12] Higher‑order muP: https://kexue.fm/archives/10795 [13] Loss‑Free balancing: https://kexue.fm/archives/10757
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
