Artificial Intelligence 8 min read

Qwen3-Next Unveiled: Sparse MoE, Hybrid Attention & Multi‑Token Prediction

A recent Hugging Face pull request reveals Alibaba’s upcoming Qwen3‑Next series, highlighting its extreme‑context, parameter‑efficient design that combines a 1:50 high‑sparsity MoE, a hybrid attention architecture mixing gated attention with Gated DeltaNet, and a Multi‑Token Prediction technique, promising ten‑fold throughput gains for 32K‑plus token contexts.

Baobao Algorithm Notes

Sep 10, 2025

Qwen3-Next Unveiled: Sparse MoE, Hybrid Attention & Multi‑Token Prediction

Background

The Hugging Face transformers repository contains a pull request (PR #40771, https://github.com/huggingface/transformers/pull/40771/files) that adds support for a new member of Alibaba’s Qwen model family, provisionally called Qwen3-Next . The model has not been released yet, but the submitted documentation and code reveal several architectural innovations aimed at extreme context length handling and parameter‑efficient scaling.

Core Highlights of Qwen3-Next

According to the model documentation ( docs/source/en/model_doc/qwen3_next.md), the Qwen3-Next series is positioned as a next‑generation foundation model optimized for extreme context length and large‑scale parameter efficiency . The flagship variant, Qwen3-Next-80B-A3B, has 800 billion total parameters but activates only 30 billion during inference, delivering performance that exceeds a dense 320 billion‑parameter model while providing roughly ten‑fold higher throughput for contexts longer than 32 K tokens.

Architecture Innovation 1: High‑Sparsity MoE (1:50 Activation)

The model uses a highly sparse Mixture‑of‑Experts (MoE) layer with an activation ratio of 1:50. The default MoE configuration is defined in

src/transformers/models/qwen3_next/configuration_qwen3_next.py

# in src/transformers/models/qwen3_next/configuration_qwen3_next.py
class Qwen3NextConfig(PretrainedConfig):
    def __init__(self, *,
                 # ...
                 num_experts_per_tok=10,
                 num_experts=512,
                 # ...):
        ...

With 512 total experts ( num_experts) and 10 experts selected per token ( num_experts_per_tok), the effective activation ratio is 10/512 ≈ 1:50. This “multiple‑choice” routing provides richer feature combinations and smoother expert selection compared with a single‑expert MoE, contributing to the model’s high performance.

Architecture Innovation 2: Hybrid Attention Mechanism

To process ultra‑long contexts efficiently, Qwen3‑Next replaces standard self‑attention with a hybrid attention system that alternates between Gated Attention (full attention) and Gated DeltaNet (a state‑space model‑based linear attention). The layer pattern is encoded in the same configuration file:

# in src/transformers/models/qwen3_next/configuration_qwen3_next.py
class Qwen3NextConfig(PretrainedConfig):
    def __init__(self, *,
                 layer_types=None,
                 # ...):
        self.layer_types = layer_types
        if self.layer_types is None:
            self.layer_types = [
                "linear_attention" if bool((i + 1) % 4) else "full_attention"
                for i in range(self.num_hidden_layers)
            ]

This creates a repeating pattern where, for every four transformer layers, three use linear_attention (mapped to Gated DeltaNet) and one uses full_attention (Gated Attention). A unit test ( tests/models/qwen3_next/test_modeling_qwen3_next.py) verifies that attention outputs are produced only from the full_attention layers:

# in tests/models/qwen3_next/test_modeling_qwen3_next.py
def test_attention_outputs(self):
    """Ensures that only full_attention layers emit attention scores."""
    with torch.no_grad():
        outputs = model(**self._prepare_for_class(inputs_dict, model_class))
    attentions = outputs.attentions
    self.assertEqual(
        len(attentions),
        sum(layer == "full_attention" for layer in config.layer_types)
    )

This hybrid approach lets the model capture critical information with full attention while scaling efficiently with linear attention for the majority of layers, achieving a balance between accuracy and speed.

Architecture Innovation 3: Multi‑Token Prediction (MTP)

Qwen3‑Next also incorporates Multi‑Token Prediction (MTP), allowing the model to predict several future tokens in parallel during pre‑training. Compared with the traditional next‑token paradigm, MTP improves training throughput and helps the model develop better long‑range planning capabilities, which is considered a key technique for next‑generation large language models.

Conclusion

The pull request introduces a substantial architectural overhaul to the Qwen series: a 1:50 high‑sparsity MoE, a hybrid attention scheme that mixes gated full attention with state‑space‑based linear attention, and a Multi‑Token Prediction training strategy. Together these innovations aim to set new standards for long‑context handling and computational efficiency in large language models.

large language models AI Architecture Multi‑Token Prediction sparse MoE Hybrid attention Qwen3-Next

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.