Artificial Intelligence 11 min read

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

The article explains CNN inference optimization by applying PyTorch quantization and module‑fusion techniques, compares model size and latency before and after quantization, shows code for building, quantizing, and fusing a simple CNN, and presents benchmark results on CPU, highlighting a four‑fold size reduction and up to 1.7× speed‑up.

Code DAO

May 21, 2022

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

What is Quantization?

Quantization is a simple technique to accelerate deep‑learning models during inference by compressing 32‑bit floating‑point parameters to 8‑bit integers, reducing model size and memory demand roughly four‑fold, at the cost of some accuracy.

It requires a mapping function from floating‑point to integer, typically a linear transform Q(r) = round(r / S) + Z, where S is the scale factor and Z is the zero‑point. Calibration finds S and Z, often using the min‑max of activation ranges or more advanced methods such as MSE or entropy minimization.

PyTorch provides an Observer module that collects statistics to compute S and Z, supporting per‑tensor or per‑channel calibration.

Quantization Techniques

Dynamic Quantization

Weights are quantized ahead of time, while activations are quantized dynamically at inference, offering higher accuracy and suitable for LSTM, GRU, and RNN models.

Post‑Training Static Quantization

Both weights and activations are quantized before inference; calibration runs on a validation set. It is faster than dynamic quantization but may require periodic recalibration to stay robust.

Quantization‑Aware Training (QAT)

Injects quantization error into the training loss so that scale and zero‑point parameters are learned during training. The article applies post‑training quantization to a CNN.

What is Module Fusion?

Fusion merges consecutive layers (e.g., Conv+ReLU, Conv+BatchNorm, Conv+BatchNorm+ReLU, Linear+ReLU) into a single operation, reducing memory accesses and inference time. The trade‑off is reduced debuggability, and fusion only applies to specific layer patterns.

PyTorch Implementation and Comparison

First a simple CNN is defined with two convolutional blocks, two fully‑connected layers, and a Softmax output.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 20, kernel_size=(5,5))
        self.relu1 = nn.ReLU()
        self.maxpool1 = nn.MaxPool2d(kernel_size=(2,2), stride=(2,2))
        self.conv2 = nn.Conv2d(20, 50, kernel_size=(5,5))
        self.relu2 = nn.ReLU()
        self.maxpool2 = nn.MaxPool2d(kernel_size=(2,2), stride=(2,2))
        self.fc1 = nn.Linear(50*53*53, 500)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(500, 10)
        self.Softmax = nn.Softmax(1)
    def forward(self, x):
        x = self.conv1(x); x = self.relu1(x); x = self.maxpool1(x)
        x = self.conv2(x); x = self.relu2(x); x = self.maxpool2(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x); x = self.relu3(x)
        x = self.fc2(x)
        output = self.Softmax(x)
        return output

The model summary shows about 70 million parameters and an estimated size of 294 MB.

To create a quantized version, two new modules ( QuantStub and DeQuantStub) are added, and the forward pass quantizes the input and de‑quantizes before the final Softmax.

class NetQuant(nn.Module):
    def __init__(self):
        super(NetQuant, self).__init__()
        self.quant = torch.quantization.QuantStub()
        # same layers as Net …
        self.dequant = torch.quantization.DeQuantStub()
    def forward(self, x):
        x = self.quant(x)
        # same forward as Net …
        x = self.fc2(x)
        x = self.dequant(x)
        x = self.Softmax(x)
        return x

Quantization configuration uses the “fbgemm” backend for x86 CPUs (or “qnnpack” for ARM). The workflow is:

net = Net(); net.eval()
net_quant = NetQuant(); net_quant.eval()
net_quant.qconfig = torch.quantization.get_default_qconfig("fbgemm")
torch.backends.quantized.engine = "fbgemm"
net_quant = torch.quantization.prepare(net_quant.cpu(), inplace=False)
net_quant = torch.quantization.convert(net_quant, inplace=False)

Model size measurement shows the quantized model is roughly four times smaller than the original.

Latency testing on CPU with a batch of 32 inputs shows the FP32 model takes 162 ms, while the INT8 quantized model is about 1.7× faster.

Fusion is then applied to the quantized model using the layer groups ['conv1','relu1'], ['conv2','relu2'], ['fc1','relu3'].

modules_to_fuse = [['conv1','relu1'], ['conv2','relu2'], ['fc1','relu3']]
net_quant_fused = torch.quantization.fuse_modules(net_quant, modules_to_fuse)
net_fused = torch.quantization.fuse_modules(net, modules_to_fuse)

Latency measurements indicate that fused models (with or without quantization) gain additional speed, though the impact on accuracy is negligible compared to quantization alone.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CNN model compression Quantization PyTorch edge inference performance benchmarking module fusion

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.