Qwen3-Next: Achieving Unmatched Training and Inference Cost‑Effectiveness

Alibaba's Qwen team unveils Qwen3-Next, a hybrid expert LLM with 800 B parameters but only 30 B active, delivering training costs under one‑tenth of comparable dense models and more than ten‑fold inference throughput for long contexts, while matching or surpassing larger models on benchmark tasks.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Qwen3-Next: Achieving Unmatched Training and Inference Cost‑Effectiveness

Introduction

Alibaba's Qwen team introduced the hybrid expert model Qwen3-Next, designed to improve training and inference efficiency for long contexts and ultra‑large parameter configurations.

Model Variants and Cost Efficiency

The base model Qwen3-Next-80B-A3B-Base contains 800 billion parameters but activates only 30 billion. It matches or slightly exceeds the performance of the dense Qwen3-32B model while requiring less than one‑tenth of the GPU hours used by Qwen3-32B.

In inference with context lengths above 32 k, its throughput is more than ten times that of Qwen3-32B.

Performance Highlights

Pre‑training efficiency: Qwen3-Next consumes only 9.3 % of the compute resources of Qwen3-32B and less than 80 % of the GPU time of Qwen3-30A-3B.

Inference speed: 7‑10× higher pre‑fill throughput and 4‑10× higher decode throughput compared with Qwen3-32B, despite activating only 1/10 of the parameters.

Instruction model: Qwen3-Next-80B-A3B-Instruct outperforms larger‑parameter models and approaches the flagship Qwen3-235B-A22B-Instruct-2507, especially on long‑context tasks such as the RULER 256 k benchmark.

Thinking model: Qwen3-Next-80B-A3B-Thinking surpasses high‑cost models and even beats the closed‑source Gemini-2.5-Flash-Thinking, while nearing the performance of Qwen3-235B-A22B-Thinking-2507.

Innovation – Mixed Architecture

Linear attention reduces the quadratic complexity of standard attention, but alone it suffers from weak memory recall, while standard attention is slow. The team combined gated DeltaNet with standard attention in a 3:1 ratio (75 % DeltaNet, 25 % standard), achieving consistently better performance and efficiency.

Enhancements to standard attention layers include:

Output gating mechanism from prior work to alleviate low‑rank issues.

Increasing each head dimension from 128 to 256.

Applying rotary positional encoding only to the first 25 % of position dimensions to improve extrapolation on long sequences.

Innovation – Sparse MoE

Qwen3-Next uses a highly sparse MoE design: 800 B total parameters with only ~30 B active per inference. Experiments show that under a global load‑balancing scheme, increasing the total number of expert parameters while keeping the number of active experts constant continuously reduces training loss.

Compared with the previous Qwen3 MoE (128 total experts, 8 routing experts), Qwen3-Next expands to 512 total experts with a “10 routing experts + 1 shared expert” configuration, maximizing resource utilization without performance loss. The shared expert handles general tasks, while the routing experts specialize, providing both efficiency and robustness.

Innovation – Multi‑Token Prediction (MTP)

Qwen3-Next introduces native multi‑token prediction, enabling a speculative decoding module with high acceptance rates and boosting overall model performance. The MTP is optimized for multi‑step inference through consistent training‑inference objectives, improving speculative decoding in practice.

The design suggests that techniques such as chain‑of‑thought, speculative decoding, role‑playing, and memory mechanisms could become native capabilities of future models.

Usage Example

The model code is merged into the main branch of Hugging Face transformers. Installation and a sample inference script are provided:

pip install git+https://github.com/huggingface/transformers.git@main
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
)

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=16384)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)

Conclusion

Qwen3-Next delivers major architectural breakthroughs, integrating linear attention, gated mechanisms, and higher‑sparsity MoE. Its 80 B‑parameter variant matches the performance of much larger dense models such as Qwen3-235B-A22B-2507 while offering substantially higher inference speed, especially for long‑context scenarios.

AILLMbenchmarkMulti‑Token Predictionsparse MoEmixed attentionQwen3-Next
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.