Can Frequency‑Domain Learning Boost Image Inference Efficiency?
This article presents a system‑level approach that performs deep‑learning inference directly on JPEG frequency components, uses a gating mechanism to select important DCT coefficients, and demonstrates higher accuracy with far lower bandwidth for image classification and instance segmentation tasks.
1. Basic Framework of Image Transmission/Storage/Analysis System
Modern computer‑vision pipelines process RGB images through capture, compression, transmission, decompression and inference. In real‑time systems the compression/decompression stages can dominate latency and power consumption, especially the DCT and IDCT modules.
Figure 1 shows that image processing time can be twice that of inference in a GPU‑based system.
2. Machine Learning in the Frequency Domain
2.1 Using Frequency Information for Learning
Each 8×8 DCT block yields 64 coefficients per channel (Y, Cb, Cr), producing 192 feature maps (56×56×192 for a 448×448 image). These maps can be fed directly into existing CNNs such as ResNet‑50 without spatial‑domain conversion.
Figure 10(a) illustrates the preprocessing pipeline; Figure 10(b) shows how the frequency‑domain feature maps are attached to the first Residual Block of ResNet‑50.
2.2 Selecting Important Frequency Components
A gating network learns the importance of each of the 192 feature maps. After average pooling each map to a scalar, a fully‑connected layer produces a 2‑dimensional score per map; the larger score determines whether the map is kept. The gate is trained with a Gumbel‑softmax estimator so gradients can flow through the discrete selection.
Figure 11 visualises the gating mechanism.
Two selection strategies are explored:
Dynamic : the gate decides per input which frequencies to transmit, reducing bandwidth on a per‑image basis.
Static : a fixed subset of frequencies is chosen after training, allowing the encoder to omit unimportant coefficients entirely.
3. Experimental Results
3.1 Image Classification
Using ImageNet, ResNet‑50 and MobileNetV2 were evaluated. Selecting only 24 out of 192 frequency maps (14 Y, 5 Cb, 5 Cr) reduced transmission bandwidth to one‑eighth while improving top‑1 accuracy from 75.78 % to 77.20 % for ResNet‑50 and from 71.70 % to 72.36 % for MobileNetV2.
Figure 13 shows a heat‑map of frequency importance for classification.
3.2 Instance Segmentation
Mask R‑CNN was trained on COCO. Using the same frequency‑selection strategy increased mask‑AP from 34.2 % to 35.0 % and object‑detection AP from 37.3 % to 38.1 %.
Figure 15 provides a visual example of instance segmentation using selected frequency components.
4. Future Work and Discussion
The current dynamic approach avoids modifying the encoder, but a static scheme could further reduce encoding cost and transmission bandwidth. Extending the method to video compression, where inter‑frame prediction changes frequency characteristics, is a promising direction. Ultimately, the research questions whether machine‑learning‑friendly features can be designed to discard spatial‑domain redundancy and save bandwidth between decoder and AI engine.
Acknowledgement
The work was conducted by Kai Xu, Minghai Qin, Yuhao Wang, Fei Sun, Chao Cheng, Yen‑kuang Chen, and Yuan Xie at Alibaba DAMO Academy, with contributions from Prof. Fengbo Ren (Arizona State University).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
