DaTaobao Tech
Nov 24, 2023 · Artificial Intelligence
Performance Optimization of Depthwise Conv Int8 on ARM CPUs
By converting the input format to a C16 layout and exploiting the ARM V8.2 Sdot instruction, the Int8 depthwise‑convolution operator on ARM CPUs can be accelerated from 4.46 ms to 1.75 ms—a 2.5× speedup—though the required data‑rearrangement overhead prevents it from overtaking FP16 performance.
ARMDepthwiseConvolutionInt8
0 likes · 10 min read