Tagged articles
120 articles
Page 2 of 2
Meituan Technology Team
Meituan Technology Team
Sep 22, 2022 · Artificial Intelligence

Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation

The paper proposes a full quantization pipeline for YOLOv6 that combines a re‑parameterization optimizer, partial PTQ, channel‑wise distillation, graph‑scale merging, and GPU‑offloaded preprocessing, enabling an INT8 model to retain ~42 % mAP while delivering over 200 % throughput increase and 40 % QPS gain versus FP16.

Channel DistillationModel DeploymentPTQ
0 likes · 16 min read
Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation
Kuaishou Large Model
Kuaishou Large Model
Jul 29, 2022 · Fundamentals

How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations

This article explains how researchers applied quantization techniques to high‑resolution physical simulations, enabling over 50% memory reduction without noticeable visual loss, by modeling error propagation, using constrained optimization, and introducing dithering, with results demonstrated on GPU‑based smoke, fluid, and elastic body simulations.

GPU memory optimizationPhysical SimulationSIGGRAPH
0 likes · 6 min read
How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations
DataFunSummit
DataFunSummit
Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Deep LearningDynamic BatchingInference Acceleration
0 likes · 12 min read
Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques
Code DAO
Code DAO
May 21, 2022 · Artificial Intelligence

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

The article explains CNN inference optimization by applying PyTorch quantization and module‑fusion techniques, compares model size and latency before and after quantization, shows code for building, quantizing, and fusing a simple CNN, and presents benchmark results on CPU, highlighting a four‑fold size reduction and up to 1.7× speed‑up.

CNNPyTorchedge inference
0 likes · 11 min read
How Quantization and Fusion Accelerate CNN Inference on Edge Devices
DataFunTalk
DataFunTalk
Apr 22, 2022 · Artificial Intelligence

Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models

This article presents a comprehensive overview of inference optimization methods—including model pruning, quantization, knowledge distillation, caching, instruction‑set acceleration, and operator fusion—and details a GPU‑centric parallel acceleration methodology with CUDA basics, performance‑analysis tools, theoretical limits, and practical case studies, all illustrated with real‑world examples from Tencent's intelligent dialogue products.

GPU AccelerationOperator fusioncaching
0 likes · 18 min read
Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models
DataFunSummit
DataFunSummit
Jan 29, 2022 · Artificial Intelligence

Survey of Model Pruning and Quantization Techniques for Deep Learning

This article provides a comprehensive overview of recent advances in deep learning model compression, focusing on pruning methods—including unstructured, structured, filter-wise, channel-wise, shape-wise, and stripe-wise approaches—and quantization techniques such as linear, non‑linear, clustering, power‑of‑two, binary, and 8‑bit quantization, while discussing evaluation criteria, sparsity ratios, fine‑tuning, and training‑aware quantization.

Deep LearningNeural Networksmodel compression
0 likes · 23 min read
Survey of Model Pruning and Quantization Techniques for Deep Learning
Laiye Technology Team
Laiye Technology Team
Jan 28, 2022 · Artificial Intelligence

Survey of Model Compression and Quantization Techniques for Deep Neural Networks

This article provides a comprehensive overview of deep learning model compression and acceleration methods, detailing pruning strategies, various pruning types, evaluation criteria, sparsity ratios, fine‑tuning procedures, as well as linear and non‑linear quantization approaches, their implementations, and practical considerations.

Deep LearningNeural Networksefficiency
0 likes · 26 min read
Survey of Model Compression and Quantization Techniques for Deep Neural Networks
DataFunSummit
DataFunSummit
Jun 5, 2021 · Artificial Intelligence

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

This article reviews BERT’s architecture, analyzes the storage and compute costs of each layer, and systematically presents compression methods—including quantization, pruning, knowledge distillation (Distilled BiLSTM and MobileBERT), and structure‑preserving techniques—aimed at enabling efficient deployment on resource‑constrained mobile devices.

BERTMobile Deploymentknowledge distillation
0 likes · 15 min read
Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods
DataFunTalk
DataFunTalk
Jun 3, 2021 · Artificial Intelligence

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure-Preserving Methods

This article examines the internal structure of BERT and systematically presents various model‑compression strategies—including quantization, pruning, knowledge distillation, and structure‑preserving techniques—highlighting their impact on storage, computational cost, and inference speed for deployment on resource‑constrained mobile devices.

BERTMobile AIknowledge distillation
0 likes · 16 min read
Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure-Preserving Methods
Kuaishou Tech
Kuaishou Tech
Mar 18, 2021 · Artificial Intelligence

Hammer: An Integrated Hardware-Aware Model Compression Framework

Hammer is an integrated hardware-aware model compression tool developed by Kuaishou in collaboration with universities, combining pruning, quantization, search, and distillation to achieve efficient and accurate neural network models tailored to specific hardware.

AI FrameworkKuaishouNAS
0 likes · 9 min read
Hammer: An Integrated Hardware-Aware Model Compression Framework
Sohu Tech Products
Sohu Tech Products
Jan 6, 2021 · Artificial Intelligence

Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation

This article reviews four mainstream model compression and acceleration methods—structural optimization, pruning, quantization, and knowledge distillation—explaining their principles, implementations, and performance, and presents practical examples such as DistillBERT, TinyBERT, and FastBERT with comparative results.

AIDeep Learningknowledge distillation
0 likes · 14 min read
Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation
Didi Tech
Didi Tech
Oct 21, 2020 · Artificial Intelligence

Deep Model Compression Techniques for Intelligent Automotive Cockpits

The article reviews deep‑model compression methods—ADMM‑based structured pruning, low‑bit quantization, and teacher‑student knowledge distillation—and their automated AutoCompress workflow, demonstrating how these techniques shrink neural networks enough to run real‑time driver‑monitoring and other intelligent cockpit functions on resource‑limited automotive hardware while preserving accuracy.

ADMMDeep Learningedge AI
0 likes · 16 min read
Deep Model Compression Techniques for Intelligent Automotive Cockpits
AntTech
AntTech
Jun 9, 2020 · Artificial Intelligence

Deep Learning Model Compression and Acceleration Techniques for Mobile AI

This article reviews the motivations, challenges, and a comprehensive set of algorithmic, framework, and hardware methods—including structural optimization, quantization, pruning, and knowledge distillation—to compress and accelerate deep learning models for deployment on mobile devices, highlighting benefits such as reduced server load, lower latency, improved reliability, and enhanced privacy.

Mobile AIknowledge distillationmodel compression
0 likes · 17 min read
Deep Learning Model Compression and Acceleration Techniques for Mobile AI
Tencent Tech
Tencent Tech
Feb 27, 2020 · Artificial Intelligence

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.

Deep LearningDistributed Trainingknowledge distillation
0 likes · 14 min read
How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques
Alibaba Cloud Developer
Alibaba Cloud Developer
May 21, 2019 · Artificial Intelligence

How Alibaba’s Offline AI Advances Model Compression and Edge Inference

Alibaba’s Machine Intelligence Lab shares two years of breakthroughs in offline AI, detailing low‑bit quantization, unified sparsity frameworks, hardware‑software co‑design, lightweight networks, and on‑device detection, alongside standardized training tools, multi‑platform inference engines, and productized edge solutions such as smart boxes and integrated cameras.

AIedge inferencehardware-software co-design
0 likes · 16 min read
How Alibaba’s Offline AI Advances Model Compression and Edge Inference
Hulu Beijing
Hulu Beijing
Apr 30, 2019 · Artificial Intelligence

How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained

This article reviews why deep neural networks are over‑parameterized, outlines the challenges of deploying them on mobile and embedded devices, and presents six major strategies—pruning, low‑rank approximation, filter selection, quantization, knowledge distillation, and novel architecture design—to accelerate and compress models while preserving performance.

Deep Learningknowledge distillationmodel acceleration
0 likes · 11 min read
How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained
Tencent Architect
Tencent Architect
Nov 13, 2017 · Artificial Intelligence

Survey of Bandwidth Optimization Techniques in AI Accelerators

This article reviews various architectural strategies—including streaming processing, on‑chip memory optimization, bit‑width compression, sparsity techniques, on‑chip models with chip‑level interconnects, and emerging technologies such as binary networks, memristors, and HBM—to alleviate bandwidth bottlenecks in FPGA/ASIC/TPU AI accelerators.

AIASICAccelerators
0 likes · 20 min read
Survey of Bandwidth Optimization Techniques in AI Accelerators