13 min read

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Qwen1.5‑MoE‑A2.7B is a 2.7 billion‑parameter Mixture‑of‑Experts model that delivers performance comparable to leading 7 billion‑parameter LLMs while cutting training cost by 75% and boosting inference speed by 1.74×, and the article details its architecture, benchmarks, efficiency analysis, and deployment steps.

Baobao Algorithm Notes

Mar 28, 2024

How Qwen1.5‑MoE‑A2.7B Matches 70B LLM Performance with Only 2.7B Activated Parameters

Introduction The Qwen team released their first MoE model, Qwen1.5‑MoE‑A2.7B, which has 2.7 billion activated parameters but achieves performance on par with state‑of‑the‑art 7 billion‑parameter models such as Mistral‑7B and Qwen1.5‑7B. Compared with the dense 7 billion‑parameter baseline, it reduces training cost by 75% and improves inference speed by 1.74×.

Model Architecture

The model adopts a specially designed MoE architecture. Traditional MoE layers (e.g., Mixtral) use eight experts per transformer block with a top‑2 gating strategy. Qwen1.5‑MoE introduces three key improvements:

Fine‑grained experts: each FFN is split into multiple independent experts, enabling 64 experts without increasing total parameter count.

Improved initialization: the model is initialized from the existing Qwen‑1.8B checkpoint, adding randomness to accelerate convergence and improve overall performance.

New routing mechanism: the model contains four always‑active shared experts and 60 routing experts, of which only four are activated per token, providing flexibility and efficiency.

Performance Evaluation

Both the base and chat variants were evaluated on a range of benchmarks. The base model was tested on MMLU, GSM8K, HumanEval, and multilingual tasks; the chat model was evaluated with MT‑Bench. Results (higher is better) are summarized below:

Model                MMLU   GSM8K  HumanEval  Multilingual  MT‑Bench
Mistral‑7B           64.1   47.5   27.4       40.0          7.60
Gemma‑7B             64.6   50.9   32.3       -             -
Qwen1.5‑7B           61.0   62.5   36.0       45.2          7.60
DeepSeekMoE‑16B      45.0   18.8   26.8       -             6.93
Qwen1.5‑MoE‑A2.7B    62.5   61.5   34.2       40.8          7.17

The MoE model’s scores are very close to the best 7B models, indicating competitive language understanding, reasoning, and code generation capabilities.

Training Cost and Inference Efficiency

Key parameter statistics illustrate the efficiency gains:

Model                #Parameters  #Activated Params  #Activated Non‑Embedding Params
Mistral‑7B           7.2B         7.2B                7.0B
Gemma‑7B             8.5B         7.8B                7.8B
Qwen1.5‑7B           7.7B         7.7B                6.4B
DeepSeekMoE‑16B      16.4B        2.8B                2.4B
Qwen1.5‑MoE‑A2.7B    14.3B        2.7B                2.0B

Despite a larger total parameter count, the activated non‑embedding parameters are far lower than dense 7B models, leading to a 75% reduction in training cost. Inference benchmarks on a single NVIDIA A100‑80G GPU show:

Model                     Throughput (req/s)   TPS
Qwen2‑7B‑Chat               1.15                2298.89
Qwen1.5‑MoE‑A2.7B‑Chat      2.01                4010.27

The MoE model processes roughly 1.74× more requests per second, thanks to sparse activation and shared‑expert optimizations.

Deployment and Usage

Because the latest Hugging Face release does not yet include qwen2_moe, users must install the transformers library from source:

git clone https://github.com/huggingface/transformers
cd transformers
pip install -e .

Example code to load the quantized chat model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B-Chat")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

For high‑throughput serving, the model can be deployed with vLLM:

git clone https://github.com/wenyujin333/vllm.git
cd vllm
git checkout add_qwen_moe
pip install -e .

Running the OpenAI‑compatible API server:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-MoE-A2.7B-Chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen1.5-MoE-A2.7B-Chat",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Tell me something about large language models."}
        ]
      }'

Future work includes adding support for third‑party runtimes such as llama.cpp and MLX.

Conclusion

The Qwen1.5‑MoE‑A2.7B model demonstrates that a carefully designed MoE architecture can achieve performance comparable to much larger dense models while dramatically reducing training expenses and inference latency. Ongoing research aims to further improve MoE fine‑tuning and expand ecosystem support.

Large Language Model Qwen MoE Training Efficiency Inference Speed Model Benchmark

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.