Artificial Intelligence 16 min read

Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation

The paper proposes a full quantization pipeline for YOLOv6 that combines a re‑parameterization optimizer, partial PTQ, channel‑wise distillation, graph‑scale merging, and GPU‑offloaded preprocessing, enabling an INT8 model to retain ~42 % mAP while delivering over 200 % throughput increase and 40 % QPS gain versus FP16.

Meituan Technology Team

Sep 22, 2022

Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation

This article presents a comprehensive quantization deployment solution for the Meituan open‑source object detection framework YOLOv6, aiming to accelerate inference while preserving accuracy.

Background and challenges: YOLOv6 is a fast, high‑accuracy 2D detector widely used in industrial vision tasks. Quantization is essential for speed‑up, but the heavy use of re‑parameterization modules in YOLOv6 makes post‑training quantization (PTQ) and quantization‑aware training (QAT) difficult, leading to severe accuracy loss.

Quantization scheme practice: The paper explores two main routes—PTQ and QAT—using the YOLOv6s model as a case study. It introduces a re‑parameterization optimizer (RepOpt) that mitigates the large weight distribution variance caused by branch‑fusion operations. RepOpt‑based PTQ reduces the mAP drop from 7.4% to 1.5% (40.9% vs. 42.4% FP32). Partial PTQ further improves accuracy by keeping the most quantization‑sensitive layers in FP32, identified through sensitivity analyses (MSE, SNR, Cosine Similarity, and mAP‑based methods).

Channel‑wise distillation: To boost QAT performance, a channel‑wise knowledge distillation strategy is applied to the Neck (Rep‑PAN) features, using a self‑distillation setup where the FP32 teacher guides the INT8 student. This adds ~0.3% mAP improvement.

Graph optimization: The authors analyze TensorRT graphs and discover that QAT introduces many quantize_scale_node operations that break operator fusion, reducing throughput. By forcing identical quantization scales across divergent branches, they merge these nodes, eliminating extra kernels and restoring performance.

Online service optimization: CPU‑bound preprocessing is offloaded to GPU using NVIDIA DALI, raising INT8 throughput from 552 to 1182 images/s and fully utilizing the T4 GPU.

Results: After applying RepOpt, partial quantization, channel‑wise distillation, graph scaling, and DALI preprocessing, the final INT8 YOLOv6s model achieves ~42% mAP (≈40% QPS gain over FP16) and a total throughput increase of over 200% in industrial deployment scenarios.

The paper concludes that the proposed quantization pipeline provides a practical, high‑performance path for deploying 2D detection models at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization quantization Model deployment QAT YOLOv6 Channel Distillation PTQ

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.