Artificial Intelligence 9 min read

How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours

In a four‑hour competition, algorithm engineer Zhang Zhen from a Chinese EV company detailed his end‑to‑end workflow for quantizing the massive Qwen3‑Next‑80B model, covering sensitive‑layer analysis, iterative smoothing, fallback strategies, and parallel "horse‑race" debugging that led his team to win the GeekDay challenge.

DataFunTalk

Apr 7, 2026

How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours

Challenge Overview

On March 28, the Magic Community and Huawei Ascend hosted an extreme large‑model quantization challenge. Algorithm engineer Zhang Zhen from a Chinese new‑energy vehicle company completed the full quantization pipeline for the Qwen3‑Next‑80B model (≈150 GB, 47 layers, MoE architecture) and passed accuracy verification within four hours, winning the competition.

Model and Environment Preparation

The 150 GB model required substantial download time. Zhang familiarized himself with the MoE structure, full‑attention layers, and linear attention to anticipate quantization difficulties. Environment setup and model download were performed before the competition.

Algorithm Selection

After testing several methods, Zhang chose Iterative Smooth because it balances speed and accuracy and can transfer activation scale values across adjacent layers in MoE structures. SmoothQuant performed poorly in the W4A8 scenario, while Flex Smooth Quant achieved higher accuracy at the cost of extensive hyper‑parameter search.

Key Technical Difficulty: Sensitive Layer Handling

Sensitive‑layer analysis : Executed msModelSlim analyze to obtain a sensitivity ranking and identified the top 15 most sensitive layers.

Sensitive‑layer fallback : Kept these layers in floating‑point precision (e.g., self_attn.o_proj and mlp.experts.*.down_proj) by adding them to the fallback list.

Quantization granularity adjustment : Switched activation quantization from per_tensor to per_token, which dramatically improved accuracy.

Sub‑graph mapping supplement : Updated the adapter configuration to add sub‑graph mappings for MoE expert layers, ensuring Iterative Smooth covered every expert.

Debugging Strategy: Parallel "Horse‑Race" Experiments

Multiple quantization strategies were run in parallel to reduce iteration time. Example configurations:

Strategy A – Iterative Smooth.

Strategy B – Flex Smooth Quant.

Strategy C – Iterative Smooth with varied alpha values.

Each run took about three hours with half‑hour evaluations. A staged validation approach was also used: first verify the floating‑point model’s inference engine, then enable quantization layer by layer to quickly locate regressions.

Practical Insights for Developers

Sensitive‑layer analysis is essential : Use tools (e.g., msModelSlim analyze) to generate data‑driven sensitivity rankings. Protect the top sensitive layers with fallback or mixed precision to preserve most of the accuracy.

Stage‑wise quantization : Validate the floating‑point model, then apply quantization, and finally adjust outlier suppression, verifying at each step to pinpoint issues.

Horse‑race mechanism : Run multiple quantization variants concurrently to explore a larger solution space within limited time.

Future Work

Zhang plans to extend his quantization research to multimodal Vision‑Language Models (VLMs), aiming to automate tuning and reduce manual effort.

Resources

Magic Community Large‑Model Quantization Section: https://modelers.cn/topics/quantization Zhang Zhen’s GitHub: https://github.com/CarryChang msModelSlim code repository: https://gitcode.com/Ascend/msmodelslim msModelSlim documentation:

https://msmodelslim.readthedocs.io/

large language models model quantization multimodal models Iterative Smooth msModelSlim quantization techniques

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.