Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics

The paper by Yann LeCun’s team reveals that massive activation spikes and attention sinks in Transformers are not inherently coupled; spikes arise from position‑0 token interactions and specific feed‑forward dynamics, while attention sinks emerge from Pre‑norm normalization and head dimension, offering practical insights for model quantization and long‑context inference.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics

In Transformer architectures, two long‑standing internal phenomena—Massive Activations (spikes) and Attention Sinks—have often been assumed to be tightly coupled. The authors of The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks (arXiv:2603.05498) use mechanism‑level interpretability to systematically disentangle these effects.

Spike Phenomenon

Spikes appear as extreme outliers in a few hidden channels for a small subset of tokens. Experiments show that when sub‑optimal hyper‑parameters (e.g., disabling weight decay, Weight Decay = 0.0) are used, spike magnitude can explode to 12,275 while perplexity remains unchanged and the Sink Ratio stays at ~33.8 %.

Statistical analysis across Llama and Qwen families demonstrates that >98 % of spikes are triggered when a token occupies position 0 in the sequence, indicating a positional rather than semantic cause. The first token attends only to itself, reducing the attention block to a static linear map that pushes the hidden state toward a high‑gain direction.

The amplification originates in the step‑up block of the SwiGLU‑based feed‑forward network, where the SiLU activation operates near the identity region. Under this near‑linear regime, the output of the i ‑th coordinate can be approximated by a quadratic form, and a single dominant eigenvalue controls the spectrum of spike‑active channels (see Fig 2, Table 1).

Attention Sink Phenomenon

Attention Sinks arise when a subset of tokens (Sink Tokens) attract a disproportionate amount of attention weight. After RMSNorm (Pre‑norm) is applied, spike values are bounded and non‑spike channels are heavily suppressed, yielding a sparse multi‑hot vector. This vector, when projected through the key matrix, collapses into a low‑dimensional subspace, causing Sink Tokens to dominate the attention distribution.

Sink Ratio serves as a proxy for model optimization health. Ablation of normalization (replacing RMSNorm with Sandwich Norm or DynamicTanh) suppresses spikes completely while preserving a high Sink Ratio, confirming that spikes are not a prerequisite for sinks.

Increasing the attention‑head dimension from 8 to 128 monotonically raises Sink Ratio (Table 3), showing that larger heads provide geometric separation between Sink and non‑Sink keys. t‑SNE visualizations (Fig 4) illustrate that queries in Sink heads cluster near the fixed Sink key, widening logit gaps and routing attention.

Further Ablations

Dynamic conditional gating of attention (Per‑Channel mode) reduces Sink Ratio to 4.5 %, demonstrating that Sink behavior is a learned routing strategy that can be disabled with explicit gating.

Training on long sequences only (e.g., 2048/4096 tokens) causes Sink Ratio to collapse to ~1.2 % (Table 6), indicating that sinks primarily serve short‑range dependency handling in global attention.

Conclusions

The study overturns the belief that massive activations and attention sinks are inseparable. Spikes are a by‑product of position‑0 token processing and the step‑up block’s near‑identity SiLU activation, while sinks are an architectural response to the need for efficient routing of short‑range information under Pre‑norm normalization. These insights enable targeted architectural tweaks—such as normalization changes or gating—to suppress undesirable spikes without harming language modeling performance, benefiting quantization and long‑context inference for large language models.

Model OptimizationLLMTransformerAttention SinkMassive ActivationsPre-norm
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.