Artificial Intelligence 15 min read

What Makes Qwen3 the Next Leap in Large Language Models?

The article announces Qwen3, detailing its flagship 235B and smaller MoE models, superior benchmark performance, extensive multilingual support, expanded pretraining data, four-stage post‑training, flexible thinking modes, deployment guides for SGLang, vLLM, Ollama, and future plans toward AGI‑level capabilities.

Baobao Algorithm Notes

Apr 28, 2025

What Makes Qwen3 the Next Leap in Large Language Models?

Introduction

Qwen3 is the latest member of the Qwen series of large language models. The flagship model Qwen3‑235B‑A22B achieves competitive results on code, mathematics, and general benchmarks compared with top models such as DeepSeek‑R1, o1, o3‑mini, Grok‑3, and Gemini‑2.5‑Pro. A smaller MoE model, Qwen3‑30B‑A3B, uses only 10% of the activation parameters of Qwen‑32B while delivering stronger performance, and even the 4B variant rivals Qwen2.5‑72B‑Instruct.

Two MoE models (Qwen3‑235B‑A22B and Qwen3‑30B‑A3B) and six dense models (Qwen3‑32B, 14B, 8B, 4B, 1.7B, 0.6B) have been open‑sourced under the Apache 2.0 license.

Core Highlights

Multiple Thinking Modes

Qwen3 supports a thinking mode that performs step‑by‑step reasoning before delivering a final answer, and a non‑thinking mode that provides fast, near‑instant responses. Users can dynamically switch between modes to balance inference cost and answer quality, enabling fine‑grained control of a “thinking budget”.

Multilingual Capability

The model understands 119 languages and dialects across Indo‑European, Sino‑Tibetan, Afro‑Asiatic, Austronesian, Dravidian, Turkic, and other families, making it suitable for global applications.

Pretraining

Qwen3’s pretraining dataset is roughly double that of Qwen2.5, expanding from 180 trillion to about 360 trillion tokens and covering the same 119 languages. Data were collected from the web, PDFs (extracted with Qwen2.5‑VL), and synthetic math/code data generated by expert models Qwen2.5‑Math and Qwen2.5‑Coder.

The pretraining pipeline consists of three stages:

S1 – Over 30 trillion tokens with a 4K context length, establishing basic language skills.

S2 – An additional 5 trillion tokens emphasizing STEM, programming, and reasoning data.

S3 – Long‑context training up to 32K tokens using high‑quality extended texts.

Post‑training

A four‑stage fine‑tuning process was applied to develop hybrid models that combine deep reasoning with rapid response:

Long‑chain of thought cold‑start.

Long‑chain of thought reinforcement learning.

Fusion of thinking and non‑thinking modes.

General reinforcement learning across 20+ tasks (instruction following, format compliance, agent abilities, etc.).

This yields models that excel in STEM, coding, and reasoning while using only 10% of the activation parameters of comparable dense models, reducing training and inference costs.

Getting Started with Qwen3

Below is a minimal example for loading Qwen3‑30B‑A3B with Hugging Face Transformers:

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
# Prepare input with optional thinking mode
prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,  # Set to False to disable thinking mode
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
thinking_content = tokenizer.decode(output_ids[:output_ids.index(151668)], skip_special_tokens=True).strip("
")
content = tokenizer.decode(output_ids[output_ids.index(151668):], skip_special_tokens=True).strip("
")
print("thinking content:", thinking_content)
print("content:", content)

To disable thinking mode, set enable_thinking=False in the apply_chat_template call.

Deployment Options

For serving, use SGLang (≥ 0.4.6.post1) or vLLM (≥ 0.8.4) to create an OpenAI‑compatible endpoint:

python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3

vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1

Remove the --reasoning-parser flag (and --enable-reasoning) to turn off the thinking mode.

For local development, run ollama run qwen3:30b-a3b or use LMStudio, llama.cpp, or KTransformers.

Advanced Usage – Soft Switching

Users can embed /think or /no_think tags in prompts or system messages to toggle the thinking mode per turn.

Example multi‑turn conversation (code omitted for brevity) demonstrates dynamic switching and tool‑calling via Qwen‑Agent.

Future Development

Qwen3 marks a milestone toward AGI and ASI, with plans to scale data, model size, context length, modality breadth, and reinforcement learning with environmental feedback. The roadmap emphasizes a shift from pure model training to agent‑centric training.