Artificial Intelligence 7 min read

Qwen1.5-110B vs Llama‑3‑70B: Performance Insights of Alibaba’s 110B Model

Alibaba unveiled the 110‑billion‑parameter Qwen1.5‑110B model, featuring GQA, 32k context and multilingual support, and benchmark results show it matches or surpasses Llama‑3‑70B and Mixtral‑8x22B on a range of tasks, with notable gains in chat evaluations.

Baobao Algorithm Notes

Apr 27, 2024

Qwen1.5-110B vs Llama‑3‑70B: Performance Insights of Alibaba’s 110B Model

On April 26, Alibaba released Qwen1.5‑110B, the largest Chinese open‑source language model to date with 110 billion parameters. The model adopts Grouped Query Attention (GQA) for more efficient inference, supports a 32 k token context window, and is multilingual, covering English, Chinese, French, Spanish, German, Russian, Japanese, Korean, Vietnamese, Arabic and others.

Qwen1.5‑110B shares the same Transformer decoder architecture as the rest of the Qwen1.5 series. Its key architectural traits are the use of GQA and the extended context length, while the pre‑training recipe remains unchanged from the 72 B version.

Zero‑shot benchmark results (higher is better):

MMLU: 80.4 (vs 77.5 for 72 B, 79.5 for Llama‑3‑70B, 77.8 for Mixtral‑8x22B)

TheoremQA: 34.9 (vs 29.3, 32.0, 35.9)

GPQA: 35.9 (vs 36.3, 36.4, 34.3)

Hellaswag: 87.5 (vs 86.0, 88.0, 88.7)

BBH: 74.8 (vs 65.5, 76.6, 69.2)

ARC‑C: 69.6 (vs 65.9, 68.8, 70.7)

GSM8K: 85.4 (vs 79.5, 79.2, 78.6)

MATH: 49.6 (vs 34.1, 41.0, 41.7)

HumanEval: 52.4 (vs 41.5, 45.7, 45.1)

MBPP: 58.1 (vs 53.4, 55.1, 71.2)

The results demonstrate that the 110 B model is at least on par with Llama‑3‑70B across core language abilities and often exceeds the 72 B Qwen1.5 baseline.

Chat‑oriented evaluation:

MT‑Bench average score: 8.88 (Llama‑3‑70B‑Instruct 8.85, Qwen1.5‑72B‑Chat 8.61)

AlpacaEval 2.0 win rate: 43.90 % (Llama‑3‑70B‑Instruct 34.40 %, Qwen1.5‑72B‑Chat 36.60 %)

These figures show a clear advantage for the 110 B model in interactive settings, confirming that scaling the model size alone can yield substantial chat performance improvements without major changes to the training pipeline.

In conclusion, the performance uplift of Qwen1.5‑110B is primarily attributed to its increased scale. The authors anticipate that future releases will combine larger datasets with even bigger models to capture the benefits of both dimensions.

Blog: https://qwenlm.github.io/blog/qwen1.5-110b
HF: https://huggingface.co/Qwen/Qwen1.5-110B-Chat
Demo: https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Model Scaling Qwen1.5-110B

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.