SpikingBrain 2.0 Breaks Long‑Sequence and Low‑Power Bottlenecks in Brain‑Inspired LLMs
The Chinese Academy of Sciences unveils SpikingBrain 2.0‑5B, a brain‑inspired large model that uses dual‑space sparse attention and dual activation (FP8 and INT8‑Spiking) to cut training cost by over tenfold, achieve up to 15× speedup on long sequences, and match Qwen‑3 performance while drastically reducing power consumption.
Background
Large‑model research is moving from parameter‑scale to context‑length driven development. Applications such as agents, code understanding and long‑document analysis require handling tens of thousands to millions of tokens. Traditional Transformers incur high compute and energy costs: feed‑forward matrix multiplication dominates short‑sequence workloads, while attention becomes the bottleneck for long sequences.
SpikingBrain 2.0‑5B Architecture
Dual‑Space Sparse Attention (DSSA) mixes block‑sparse Softmax attention (MoBA) on the full key‑value cache with Sparse State Expansion (SSE) on compressed state representations, emulating sparse memory mechanisms observed in the brain.
Dual activation‑value encoding paths :
FP8 path leverages low‑bit Tensor‑Core acceleration on industrial GPUs (e.g., NVIDIA Hopper) for dense matrix multiplication.
INT8‑Spiking path converts activations into spike sequences, allowing event‑driven integer accumulation on asynchronous neuromorphic chips.
Training Pipeline
The Transformer‑to‑Hybrid conversion pipeline reduces the continuation‑training data from 150 B tokens (SpB 1.0) to 14 B tokens. Only 32 A100 GPUs are needed for nine days of continual pre‑training, achieving a total conversion cost below 7 k A100‑GPU‑hours.
LLM conversion includes short‑context distillation, a three‑stage long‑context extension up to 512 k tokens, and a two‑stage SFT with policy distillation. VLM conversion adds knowledge distillation and instruction fine‑tuning.
Performance Evaluation
Long‑sequence efficiency
On HuggingFace sequence‑parallel, 4 M token first‑token generation (TTFT) is 10.13× faster than Qwen‑3.
FP8 quantization on the same length yields a 15.13× speedup versus Qwen‑3 BF16 with only 0.24% accuracy loss.
In vLLM tensor‑parallel tests, 512 k token latency drops 4.3×, 128 k token throughput rises 1.57×, and request concurrency improves 3.17×.
Eight A100 cards can infer sequences up to 10 M tokens, whereas Qwen‑3 exceeds memory limits at 4 M tokens.
Training cost
Data volume reduced from 150 B to 14 B tokens (≈10× lower).
Training completed with 32 A100 GPUs in nine days, cutting overall cost by more than tenfold compared with SpB 1.0.
Benchmark results
Matches Qwen‑3 on MMLU, ARC‑C, BBH, GSM8K, MATH, HumanEval and MBPP.
Outperforms Qwen‑2.5 and the larger SpB 1.0‑7B on the same tasks.
Hardware adaptation
FP8 path on H100: 256 k token TTFT 2.5× faster than BF16; 4 M token TTFT 15.13× faster than Qwen‑3 BF16.
INT8‑Spiking path: accuracy loss 0.69%; spike sparsity 64.3%.
Neuromorphic simulation shows a 70.6% area reduction and power reduction at 250 MHz / 500 MHz compared with an INT8 matrix‑multiply baseline.
Resources
Paper: https://arxiv.org/abs/2604.22575
Code: https://github.com/BICLab/SpikingBrain2.0
Code example
来源:ScienceAI
本文
约3000字
,建议阅读
5
分钟
验证了类脑机制与高效模型架构结合的广阔前景。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
