LONGNET: Extending Transformers to Over 1 Billion Tokens

LONGNET introduces dilated attention to enable Transformers to process sequences exceeding one billion tokens with linear computational cost, preserving performance on shorter inputs and demonstrating strong results on long‑sequence modeling and standard language tasks.

Dilated AttentionLONGNETLanguage Modeling

0 likes · 6 min read

LONGNET: Extending Transformers to Over 1 Billion Tokens