Artificial Intelligence 6 min read

LONGNET: Extending Transformers to Over 1 Billion Tokens

LONGNET introduces dilated attention to enable Transformers to process sequences exceeding one billion tokens with linear computational cost, preserving performance on shorter inputs and demonstrating strong results on long‑sequence modeling and standard language tasks.

Network Intelligence Research Center (NIRC)

Aug 22, 2023

LONGNET: Extending Transformers to Over 1 Billion Tokens

Microsoft’s recent research presents LONGNET, a Transformer variant that scales sequence length beyond 1 billion tokens without sacrificing performance on shorter sequences. The core innovation is dilated attention , which reduces attention weight exponentially with token distance, achieving linear computational complexity and logarithmic token‑wise dependence.

The paper lists three main advantages: (1) linear compute cost, (2) suitability as a distributed trainer for very long sequences, and (3) seamless replacement of standard attention, allowing integration with existing Transformer optimizations.

Motivation stems from the trend of extending neural networks’ capacity; unlimited sequence length would grant models larger memory, richer causal reasoning, and mitigate catastrophic forgetting. However, extending length faces a trade‑off between computational complexity and expressive power.

Compared to prior approaches—RNN‑style models, state‑space models, low‑rank or kernel‑based attention, down‑sampling, and retrieval methods—none have reached the 1 billion token scale. LONGNET’s dilated attention offers a scalable alternative.

Implementation converts LONGNET into a dense Transformer, enabling existing optimizations such as kernel fusion, quantization, and distributed training. Linear complexity allows cross‑node parallelism, breaking memory and compute constraints. Runtime remains nearly constant as sequence length grows, unlike the quadratic growth of vanilla Transformers (see

The study also introduces multi‑head dilated attention, sparsifying query‑key‑value pairs differently across heads (illustrated in

Experiments compare LONGNET with vanilla and sparse Transformers on language modeling tasks. Models share identical non‑attention layers; only the attention mechanism differs. Sequence lengths are increased from 2K to 32K while keeping total tokens per batch constant. Results on the Stack dataset (see

) show that LONGNET consistently outperforms baselines in both complexity and language modeling metrics. The study also removes absolute positional encodings, observes that longer training sequences improve model quality, and notes that inference extrapolation methods fail when sequence length far exceeds model capacity.

Overall, LONGNET demonstrates that dilated attention can extend Transformers to unprecedented sequence lengths with linear cost, offering a practical path for modeling extremely long contexts such as whole corpora or the internet.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

transformer Language Modeling Long Sequence Modeling Dilated Attention Linear Complexity LONGNET

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.