Artificial Intelligence 10 min read

Performance Optimization of Depthwise Conv Int8 on ARM CPUs

By converting the input format to a C16 layout and exploiting the ARM V8.2 Sdot instruction, the Int8 depthwise‑convolution operator on ARM CPUs can be accelerated from 4.46 ms to 1.75 ms—a 2.5× speedup—though the required data‑rearrangement overhead prevents it from overtaking FP16 performance.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Performance Optimization of Depthwise Conv Int8 on ARM CPUs

This article presents performance optimization techniques for the Int8 depthwise convolution operator on mobile CPUs. Upgrades in ARM architecture and new instruction sets such as Sdot, combined with data layout rearrangement, can bring substantial speedups.

Background : In MNN, the ConvolutionDepthwise Int8 operator (referred to as DepthwiseConvInt8) is significantly slower than FP16 inference on Android devices—up to three times longer—affecting the efficiency of quantized models on the edge.

ARM V8 data layout optimization : The original input format uses a C4 layout (four channels interleaved). By switching to a C16 layout (sixteen channels interleaved), sixteen Int8 values can fill a vector register, improving parallelism. The assembly code below shows the difference:

/* x0: source address, read 4 points */
/* pack=4, stridex = 2, sizeof(inputData)=1 */
ld1 {v0.s}[0], [x0], #8
ld1 {v1.s}[0], [x0], #8
ld1 {v2.s}[0], [x0], #8
ld1 {v3.s}[0], [x0], #8
/* pack=16, stridex = 2, sizeof(inputData)=1 */
ld1 {v0.4s}, [x0], #32
ld1 {v1.4s}, [x0], #32
ld1 {v2.4s}, [x0], #32
ld1 {v3.4s}, [x0], #32

On a Huawei Mate40 Pro (ARM V8), the latency of the DepthwiseConvInt8 operator drops from 4.46 ms to 2.78 ms after applying the C16 layout, a 1.6× speedup.

ARM V8.2 Sdot optimization : The Sdot instruction can compute three multiply‑accumulate operations in a single instruction for a 3×3 kernel, reducing the instruction count dramatically. Example assembly:

// 3x3 kernel without sdot (9 loops)
Loop_Kernel_H3:
  Loop_Kernel_W3:
    smlal v0.4s, v1.4h, v2.4s
// 3x3 kernel with sdot
sdot v0.4s, v1.16b, v3.16b
sdot v0.4s, v2.16b, v4.16b
smlal v0.4s, v5.4h, v6.4h

Using Sdot requires additional data rearrangement: the input must be reordered so that groups of four Int8 values are contiguous. This rearrangement is performed with TBL instructions and is essential for the 3×3 kernel.

Performance results on the same device show further improvement: after C16 layout the latency is 2.78 ms, and with Sdot it becomes 1.75 ms, achieving a 2.55× overall speedup.

Conclusion : ARM V8.2 optimizations—data layout conversion to C16 and the use of Sdot—significantly accelerate DepthwiseConvInt8, bringing its performance close to the theoretical limit. However, the overhead of data rearrangement prevents Int8 quantized models from surpassing half‑precision (FP16) inference speed.

PerformanceOptimizationARMInt8DepthwiseConvolutionMobileInferenceSdot
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.