Artificial Intelligence 10 min read

SpikingBrain 2.0 Breaks Long‑Sequence and Low‑Power Bottlenecks in Brain‑Inspired LLMs

The Chinese Academy of Sciences unveils SpikingBrain 2.0‑5B, a brain‑inspired large model that uses dual‑space sparse attention and dual activation (FP8 and INT8‑Spiking) to cut training cost by over tenfold, achieve up to 15× speedup on long sequences, and match Qwen‑3 performance while drastically reducing power consumption.

Data Party THU

May 10, 2026

SpikingBrain 2.0 Breaks Long‑Sequence and Low‑Power Bottlenecks in Brain‑Inspired LLMs

Background

Large‑model research is moving from parameter‑scale to context‑length driven development. Applications such as agents, code understanding and long‑document analysis require handling tens of thousands to millions of tokens. Traditional Transformers incur high compute and energy costs: feed‑forward matrix multiplication dominates short‑sequence workloads, while attention becomes the bottleneck for long sequences.

SpikingBrain 2.0‑5B Architecture

Dual‑Space Sparse Attention (DSSA) mixes block‑sparse Softmax attention (MoBA) on the full key‑value cache with Sparse State Expansion (SSE) on compressed state representations, emulating sparse memory mechanisms observed in the brain.

Dual activation‑value encoding paths :

FP8 path leverages low‑bit Tensor‑Core acceleration on industrial GPUs (e.g., NVIDIA Hopper) for dense matrix multiplication.

INT8‑Spiking path converts activations into spike sequences, allowing event‑driven integer accumulation on asynchronous neuromorphic chips.

Training Pipeline

The Transformer‑to‑Hybrid conversion pipeline reduces the continuation‑training data from 150 B tokens (SpB 1.0) to 14 B tokens. Only 32 A100 GPUs are needed for nine days of continual pre‑training, achieving a total conversion cost below 7 k A100‑GPU‑hours.

LLM conversion includes short‑context distillation, a three‑stage long‑context extension up to 512 k tokens, and a two‑stage SFT with policy distillation. VLM conversion adds knowledge distillation and instruction fine‑tuning.

Performance Evaluation

Long‑sequence efficiency

On HuggingFace sequence‑parallel, 4 M token first‑token generation (TTFT) is 10.13× faster than Qwen‑3.

FP8 quantization on the same length yields a 15.13× speedup versus Qwen‑3 BF16 with only 0.24% accuracy loss.

In vLLM tensor‑parallel tests, 512 k token latency drops 4.3×, 128 k token throughput rises 1.57×, and request concurrency improves 3.17×.

Eight A100 cards can infer sequences up to 10 M tokens, whereas Qwen‑3 exceeds memory limits at 4 M tokens.

Training cost

Data volume reduced from 150 B to 14 B tokens (≈10× lower).

Training completed with 32 A100 GPUs in nine days, cutting overall cost by more than tenfold compared with SpB 1.0.

Benchmark results

Matches Qwen‑3 on MMLU, ARC‑C, BBH, GSM8K, MATH, HumanEval and MBPP.

Outperforms Qwen‑2.5 and the larger SpB 1.0‑7B on the same tasks.

Hardware adaptation

FP8 path on H100: 256 k token TTFT 2.5× faster than BF16; 4 M token TTFT 15.13× faster than Qwen‑3 BF16.

INT8‑Spiking path: accuracy loss 0.69%; spike sparsity 64.3%.

Neuromorphic simulation shows a 70.6% area reduction and power reduction at 250 MHz / 500 MHz compared with an INT8 matrix‑multiply baseline.

Resources

Paper: https://arxiv.org/abs/2604.22575

Code: https://github.com/BICLab/SpikingBrain2.0

Code example

来源：ScienceAI
本文
约3000字
，建议阅读
5
分钟
验证了类脑机制与高效模型架构结合的广阔前景。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model Sparse attention benchmark performance brain-inspired AI dual activation low-power inference SpikingBrain2.0

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

SpikingBrain 2.0‑5B Architecture

Training Pipeline

Performance Evaluation

Resources

Code example

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

SpikingBrain 2.0‑5B Architecture