Exploring Qwen3: Open‑Source LLM Features, Benchmarks, and Deployment Guides
This article introduces the Qwen3 family of open‑source large language models, details their architecture, parameter counts, multilingual support, and benchmark performance, and provides step‑by‑step instructions for deploying them with frameworks like SGLang, vLLM, and local runtimes such as Ollama and LMStudio.
Introduction
Qwen3 is the latest open‑source series of large language models released by Alibaba. The flagship model Qwen3‑235B‑A22B achieves competitive results on coding, mathematics, and general‑purpose benchmarks compared with top models such as DeepSeek‑R1, o1, Grok‑3 and Gemini‑2.5‑Pro. A smaller Mixture‑of‑Experts (MoE) model Qwen3‑30B‑A3B uses only 10% of the activation parameters of a 32B model while outperforming it, and even the 4B variant rivals the performance of Qwen2.5‑72B‑Instruct.
Model Overview
The released models include:
Qwen3‑235B‑A22B : 235 billion total parameters, 220 billion activation parameters.
Qwen3‑30B‑A3B : ~300 billion total parameters, 30 billion activation parameters (MoE).
Six dense models: Qwen3‑32B, Qwen3‑14B, Qwen3‑8B, Qwen3‑4B, Qwen3‑1.7B, Qwen3‑0.6B, all released under the Apache 2.0 license.
Models Layers Heads(Q/KV) Tie Embedding Context Length
Qwen3‑0.6B 28 16/8 Yes 32K
Qwen3‑1.7B 28 16/8 Yes 32K
Qwen3‑4B 36 32/8 Yes 32K
Qwen3‑8B 36 32/8 No 128K
Qwen3‑14B 40 40/8 No 128K
Qwen3‑32B 64 64/8 No 128K
Qwen3‑30B‑A3B 48 32/4 128/8 128K
Qwen3‑235B‑A22B 94 64/4 128/8 128KAll models are available on Hugging Face, ModelScope and Kaggle for immediate use.
Deployment Recommendations
For serving the models, the article recommends the SGLang and vLLM frameworks, both of which provide OpenAI‑compatible endpoints and support the model’s reasoning mode.
Local Usage
Local runtimes such as Ollama , LMStudio , MLX , llama.cpp and KTransformers can run Qwen3 models for research, development, or production workloads.
Code Example – Transformers
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # switch between thinking and non‑thinking modes
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# generate text
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# split thinking and final content
try:
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("
")
print("thinking content:", thinking_content)
print("content:", content)To disable the reasoning mode, set enable_thinking=False in apply_chat_template.
Serving with SGLang
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3Serving with vLLM
vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1Removing the --reasoning-parser (and --enable-reasoning) disables the thinking mode.
Local Development Commands
ollama run qwen3:30b-a3bSimilar commands work with LMStudio, llama.cpp, or KTransformers.
Advanced Usage – Dynamic Thinking Switch
When enable_thinking=True, the model can be toggled per turn using the special tokens /think and /no_think in user or system messages. The most recent directive controls the model’s behavior.
Agent Example with Qwen‑Agent
Qwen3’s tool‑calling capabilities are exposed through the Qwen‑Agent library. The following snippet shows how to configure the LLM, define tools (time, fetch, code interpreter), and run a multi‑turn conversation that fetches a blog URL.
from qwen_agent.agents import Assistant
llm_cfg = {
'model': 'Qwen3-30B-A3B',
'model_server': 'http://localhost:8000/v1', # OpenAI‑compatible endpoint
'api_key': 'EMPTY'
}
# Define tools (MCP servers for time and fetch, plus built‑in code interpreter)
tools = [
{'mcpServers': {
'time': {'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']},
'fetch': {'command': 'uvx', 'args': ['mcp-server-fetch']}
}},
'code_interpreter'
]
bot = Assistant(llm=llm_cfg, function_list=tools)
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
pass
print(responses)The assistant first calls the fetch-fetch tool to retrieve the blog content, then processes the result and generates a structured summary of the latest Qwen releases.
Future Directions
Qwen3 is positioned as a milestone toward artificial general intelligence (AGI) and super‑intelligence (ASI). Future work will expand data scale, model size, context length, and modality coverage, and shift focus from pure model training to training agents that can reason over long horizons using reinforcement learning.
Conclusion
Qwen3’s open‑source release provides a diverse set of dense and MoE models, multilingual capabilities (119 languages), flexible reasoning modes, and extensive tooling for deployment and agent development, empowering researchers and developers to build innovative AI solutions.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
