How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters
Qwen1.5‑MoE‑A2.7B is a 2.7 billion‑parameter Mixture‑of‑Experts model that delivers performance comparable to leading 7 billion‑parameter LLMs while cutting training cost by 75% and boosting inference speed by 1.74×, and the article details its architecture, benchmarks, efficiency analysis, and deployment steps.
Introduction The Qwen team released their first MoE model, Qwen1.5‑MoE‑A2.7B, which has 2.7 billion activated parameters but achieves performance on par with state‑of‑the‑art 7 billion‑parameter models such as Mistral‑7B and Qwen1.5‑7B. Compared with the dense 7 billion‑parameter baseline, it reduces training cost by 75% and improves inference speed by 1.74×.
Model Architecture
The model adopts a specially designed MoE architecture. Traditional MoE layers (e.g., Mixtral) use eight experts per transformer block with a top‑2 gating strategy. Qwen1.5‑MoE introduces three key improvements:
Fine‑grained experts: each FFN is split into multiple independent experts, enabling 64 experts without increasing total parameter count.
Improved initialization: the model is initialized from the existing Qwen‑1.8B checkpoint, adding randomness to accelerate convergence and improve overall performance.
New routing mechanism: the model contains four always‑active shared experts and 60 routing experts, of which only four are activated per token, providing flexibility and efficiency.
Performance Evaluation
Both the base and chat variants were evaluated on a range of benchmarks. The base model was tested on MMLU, GSM8K, HumanEval, and multilingual tasks; the chat model was evaluated with MT‑Bench. Results (higher is better) are summarized below:
Model MMLU GSM8K HumanEval Multilingual MT‑Bench
Mistral‑7B 64.1 47.5 27.4 40.0 7.60
Gemma‑7B 64.6 50.9 32.3 - -
Qwen1.5‑7B 61.0 62.5 36.0 45.2 7.60
DeepSeekMoE‑16B 45.0 18.8 26.8 - 6.93
Qwen1.5‑MoE‑A2.7B 62.5 61.5 34.2 40.8 7.17The MoE model’s scores are very close to the best 7B models, indicating competitive language understanding, reasoning, and code generation capabilities.
Training Cost and Inference Efficiency
Key parameter statistics illustrate the efficiency gains:
Model #Parameters #Activated Params #Activated Non‑Embedding Params
Mistral‑7B 7.2B 7.2B 7.0B
Gemma‑7B 8.5B 7.8B 7.8B
Qwen1.5‑7B 7.7B 7.7B 6.4B
DeepSeekMoE‑16B 16.4B 2.8B 2.4B
Qwen1.5‑MoE‑A2.7B 14.3B 2.7B 2.0BDespite a larger total parameter count, the activated non‑embedding parameters are far lower than dense 7B models, leading to a 75% reduction in training cost. Inference benchmarks on a single NVIDIA A100‑80G GPU show:
Model Throughput (req/s) TPS
Qwen2‑7B‑Chat 1.15 2298.89
Qwen1.5‑MoE‑A2.7B‑Chat 2.01 4010.27The MoE model processes roughly 1.74× more requests per second, thanks to sparse activation and shared‑expert optimizations.
Deployment and Usage
Because the latest Hugging Face release does not yet include qwen2_moe, users must install the transformers library from source:
git clone https://github.com/huggingface/transformers
cd transformers
pip install -e .Example code to load the quantized chat model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B-Chat")
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]For high‑throughput serving, the model can be deployed with vLLM:
git clone https://github.com/wenyujin333/vllm.git
cd vllm
git checkout add_qwen_moe
pip install -e .Running the OpenAI‑compatible API server:
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-MoE-A2.7B-Chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-MoE-A2.7B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
]
}'Future work includes adding support for third‑party runtimes such as llama.cpp and MLX.
Conclusion
The Qwen1.5‑MoE‑A2.7B model demonstrates that a carefully designed MoE architecture can achieve performance comparable to much larger dense models while dramatically reducing training expenses and inference latency. Ongoing research aims to further improve MoE fine‑tuning and expand ecosystem support.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
