Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines

Researchers He Kaiming, Yann LeCun and colleagues propose a 9‑line Dynamic Tanh (DyT) layer that replaces LayerNorm/RMSNorm in Transformers, showing comparable or superior accuracy across vision, language, speech and DNA tasks while also reducing inference latency on modern GPUs.

AIWalker
AIWalker
AIWalker
Dynamic Tanh Lets He Kaiming and LeCun Drop Transformer Normalization in 9 Lines

Observation of LayerNorm behavior

LayerNorm in Transformers maps inputs to outputs with an S‑shaped curve that closely matches a scaled tanh function. The mapping is linear around zero (≈99 % of points) but compresses extreme values, providing a bounded output while preserving gradients for the bulk of the distribution. This empirical observation (arXiv:2503.10622) motivates a direct replacement of the normalization step.

Dynamic Tanh (DyT) definition

DyT introduces a learnable scalar α that scales the input before a tanh non‑linearity, followed by a per‑feature affine transformation ( γ, β). The operation requires no running statistics.

class DyT(nn.Module):
    def __init__(self, num_features, alpha_init_value=0.5):
        super().__init__()
        self.alpha = nn.Parameter(torch.ones(1) * alpha_init_value)
        self.weight = nn.Parameter(torch.ones(num_features))   # γ
        self.bias = nn.Parameter(torch.zeros(num_features))    # β

    def forward(self, x):
        x = torch.tanh(self.alpha * x)
        return x * self.weight + self.bias

Experimental setup

DyT replaces LayerNorm or RMSNorm in a broad suite of Transformer‑based models without changing any hyper‑parameters:

Vision: ViT‑Base/Large, ConvNeXt‑Base/Large, MAE, DINO, DiT (B/L/XL)

Large Language Models: LLaMA‑7B, 13B, 34B, 70B (RMSNorm → DyT)

Speech: wav2vec 2.0 (base & large)

DNA sequence modeling: HyenaDNA, Caduceus

All training follows the official recipes of the original papers, ensuring a fair plug‑and‑play comparison.

Results

Across every benchmark DyT matches or exceeds the original normalization layer:

Vision (ImageNet‑1K) – Top‑1 accuracy improves by 0.1–0.3 % for ViT‑Base/Large and ConvNeXt‑Base/Large.

Self‑supervised vision (MAE, DINO) – Validation loss and downstream linear probing are on par with LayerNorm.

Diffusion Transformers (DiT) – Fréchet Inception Distance (FID) is equal or lower than the LayerNorm baseline.

LLM pre‑training (LLaMA) – Per‑token loss curves for 7B‑70B are indistinguishable from RMSNorm; final perplexities differ by <0.1 %.

Speech (wav2vec 2.0) – Validation loss remains within 0.02 of the LayerNorm baseline.

DNA (HyenaDNA, Caduceus) – GenomicBenchmarks scores are unchanged.

Analysis of the learnable α parameter

During training α tracks the inverse standard deviation of activations (1/σ). Empirically:

For non‑LLM models performance is robust to α₀ in the range 0.5–1.2.

Larger models (more parameters) benefit from smaller α₀; wider networks need lower α₀ than deeper ones.

In attention blocks a higher α₀ improves loss, while a lower α₀ in feed‑forward blocks stabilizes training.

Computation efficiency

Benchmark on an Nvidia H100 (BF16) with 4096‑token sequences:

DyT forward pass is ~30 % faster than RMSNorm.

Full training step (forward + backward) shows a similar speedup; the trend holds for FP32.

Thus DyT reduces both memory (no running mean/var) and latency.

Additional ablations

Learning‑rate tuning and α‑initialization were examined on all non‑LLM tasks:

Using the original learning‑rate yields almost identical performance; aggressive tuning offers marginal gains (<0.2 %).

Default α₀ = 0.5 is near‑optimal for most models; only ViT‑Large shows instability when α₀ > 0.6, which can be mitigated by lowering the learning‑rate.

For LLMs, a systematic sweep over α₀ on a 30B‑token pre‑training run reveals:

Optimal α₀ decreases with model size (7B → 0.45, 70B → 0.30).

Higher α₀ in attention blocks and lower α₀ elsewhere yields the lowest loss.

Model width, not depth, drives the optimal α₀ (wider models need smaller values).

Conclusions

Dynamic Tanh provides a 9‑line, statistics‑free replacement for LayerNorm/RMSNorm that preserves or improves accuracy across vision, language, speech, and genomics while offering measurable speedups on modern GPUs. The learned scaling factor α implicitly performs a normalization‑like adjustment, explaining why the method can match traditional norm layers without explicit mean/variance computation.

Project code and pretrained checkpoints are available at https://github.com/jiachenzhu/DyT

deep learningTransformerAI ResearchModel EfficiencyNormalizationDynamic Tanh
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.