Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines
Researchers He Kaiming, Yann LeCun and colleagues propose a 9‑line Dynamic Tanh (DyT) layer that replaces LayerNorm/RMSNorm in Transformers, showing comparable or superior accuracy across vision, language, speech and DNA tasks while also reducing inference latency on modern GPUs.
Observation of LayerNorm behavior
LayerNorm in Transformers maps inputs to outputs with an S‑shaped curve that closely matches a scaled tanh function. The mapping is linear around zero (≈99 % of points) but compresses extreme values, providing a bounded output while preserving gradients for the bulk of the distribution. This empirical observation (arXiv:2503.10622) motivates a direct replacement of the normalization step.
Dynamic Tanh (DyT) definition
DyT introduces a learnable scalar α that scales the input before a tanh non‑linearity, followed by a per‑feature affine transformation ( γ, β). The operation requires no running statistics.
class DyT(nn.Module):
def __init__(self, num_features, alpha_init_value=0.5):
super().__init__()
self.alpha = nn.Parameter(torch.ones(1) * alpha_init_value)
self.weight = nn.Parameter(torch.ones(num_features)) # γ
self.bias = nn.Parameter(torch.zeros(num_features)) # β
def forward(self, x):
x = torch.tanh(self.alpha * x)
return x * self.weight + self.biasExperimental setup
DyT replaces LayerNorm or RMSNorm in a broad suite of Transformer‑based models without changing any hyper‑parameters:
Vision: ViT‑Base/Large, ConvNeXt‑Base/Large, MAE, DINO, DiT (B/L/XL)
Large Language Models: LLaMA‑7B, 13B, 34B, 70B (RMSNorm → DyT)
Speech: wav2vec 2.0 (base & large)
DNA sequence modeling: HyenaDNA, Caduceus
All training follows the official recipes of the original papers, ensuring a fair plug‑and‑play comparison.
Results
Across every benchmark DyT matches or exceeds the original normalization layer:
Vision (ImageNet‑1K) – Top‑1 accuracy improves by 0.1–0.3 % for ViT‑Base/Large and ConvNeXt‑Base/Large.
Self‑supervised vision (MAE, DINO) – Validation loss and downstream linear probing are on par with LayerNorm.
Diffusion Transformers (DiT) – Fréchet Inception Distance (FID) is equal or lower than the LayerNorm baseline.
LLM pre‑training (LLaMA) – Per‑token loss curves for 7B‑70B are indistinguishable from RMSNorm; final perplexities differ by <0.1 %.
Speech (wav2vec 2.0) – Validation loss remains within 0.02 of the LayerNorm baseline.
DNA (HyenaDNA, Caduceus) – GenomicBenchmarks scores are unchanged.
Analysis of the learnable α parameter
During training α tracks the inverse standard deviation of activations (1/σ). Empirically:
For non‑LLM models performance is robust to α₀ in the range 0.5–1.2.
Larger models (more parameters) benefit from smaller α₀; wider networks need lower α₀ than deeper ones.
In attention blocks a higher α₀ improves loss, while a lower α₀ in feed‑forward blocks stabilizes training.
Computation efficiency
Benchmark on an Nvidia H100 (BF16) with 4096‑token sequences:
DyT forward pass is ~30 % faster than RMSNorm.
Full training step (forward + backward) shows a similar speedup; the trend holds for FP32.
Thus DyT reduces both memory (no running mean/var) and latency.
Additional ablations
Learning‑rate tuning and α‑initialization were examined on all non‑LLM tasks:
Using the original learning‑rate yields almost identical performance; aggressive tuning offers marginal gains (<0.2 %).
Default α₀ = 0.5 is near‑optimal for most models; only ViT‑Large shows instability when α₀ > 0.6, which can be mitigated by lowering the learning‑rate.
For LLMs, a systematic sweep over α₀ on a 30B‑token pre‑training run reveals:
Optimal α₀ decreases with model size (7B → 0.45, 70B → 0.30).
Higher α₀ in attention blocks and lower α₀ elsewhere yields the lowest loss.
Model width, not depth, drives the optimal α₀ (wider models need smaller values).
Conclusions
Dynamic Tanh provides a 9‑line, statistics‑free replacement for LayerNorm/RMSNorm that preserves or improves accuracy across vision, language, speech, and genomics while offering measurable speedups on modern GPUs. The learned scaling factor α implicitly performs a normalization‑like adjustment, explaining why the method can match traditional norm layers without explicit mean/variance computation.
Project code and pretrained checkpoints are available at https://github.com/jiachenzhu/DyT
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
