How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours
In a four‑hour competition, algorithm engineer Zhang Zhen from a Chinese EV company detailed his end‑to‑end workflow for quantizing the massive Qwen3‑Next‑80B model, covering sensitive‑layer analysis, iterative smoothing, fallback strategies, and parallel "horse‑race" debugging that led his team to win the GeekDay challenge.
Challenge Overview
On March 28, the Magic Community and Huawei Ascend hosted an extreme large‑model quantization challenge. Algorithm engineer Zhang Zhen from a Chinese new‑energy vehicle company completed the full quantization pipeline for the Qwen3‑Next‑80B model (≈150 GB, 47 layers, MoE architecture) and passed accuracy verification within four hours, winning the competition.
Model and Environment Preparation
The 150 GB model required substantial download time. Zhang familiarized himself with the MoE structure, full‑attention layers, and linear attention to anticipate quantization difficulties. Environment setup and model download were performed before the competition.
Algorithm Selection
After testing several methods, Zhang chose Iterative Smooth because it balances speed and accuracy and can transfer activation scale values across adjacent layers in MoE structures. SmoothQuant performed poorly in the W4A8 scenario, while Flex Smooth Quant achieved higher accuracy at the cost of extensive hyper‑parameter search.
Key Technical Difficulty: Sensitive Layer Handling
Sensitive‑layer analysis : Executed msModelSlim analyze to obtain a sensitivity ranking and identified the top 15 most sensitive layers.
Sensitive‑layer fallback : Kept these layers in floating‑point precision (e.g., self_attn.o_proj and mlp.experts.*.down_proj) by adding them to the fallback list.
Quantization granularity adjustment : Switched activation quantization from per_tensor to per_token, which dramatically improved accuracy.
Sub‑graph mapping supplement : Updated the adapter configuration to add sub‑graph mappings for MoE expert layers, ensuring Iterative Smooth covered every expert.
Debugging Strategy: Parallel "Horse‑Race" Experiments
Multiple quantization strategies were run in parallel to reduce iteration time. Example configurations:
Strategy A – Iterative Smooth.
Strategy B – Flex Smooth Quant.
Strategy C – Iterative Smooth with varied alpha values.
Each run took about three hours with half‑hour evaluations. A staged validation approach was also used: first verify the floating‑point model’s inference engine, then enable quantization layer by layer to quickly locate regressions.
Practical Insights for Developers
Sensitive‑layer analysis is essential : Use tools (e.g., msModelSlim analyze) to generate data‑driven sensitivity rankings. Protect the top sensitive layers with fallback or mixed precision to preserve most of the accuracy.
Stage‑wise quantization : Validate the floating‑point model, then apply quantization, and finally adjust outlier suppression, verifying at each step to pinpoint issues.
Horse‑race mechanism : Run multiple quantization variants concurrently to explore a larger solution space within limited time.
Future Work
Zhang plans to extend his quantization research to multimodal Vision‑Language Models (VLMs), aiming to automate tuning and reduce manual effort.
Resources
Magic Community Large‑Model Quantization Section: https://modelers.cn/topics/quantization Zhang Zhen’s GitHub: https://github.com/CarryChang msModelSlim code repository: https://gitcode.com/Ascend/msmodelslim msModelSlim documentation:
https://msmodelslim.readthedocs.io/DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
