How to Engineer MobileNet for Efficient Image Classification on Mobile Devices
This article details the engineering of MobileNet V1 for image classification on mobile terminals, covering its depthwise separable convolution architecture, data collection and preprocessing, model training with transfer learning, TensorFlow Lite conversion, deployment on iOS/Android, and GPU acceleration techniques for faster inference.
MobileNet Model
MobileNet is a lightweight deep neural network designed by Google for mobile devices, offering low parameter count and computational cost while achieving strong performance on image classification and object detection tasks. The article uses MobileNet V1 for image classification.
MobileNet employs depthwise separable convolutions, splitting a standard convolution into a depthwise convolution and a pointwise (1x1) convolution, dramatically reducing parameters and computation.
The total parameters of a depthwise separable convolution are DK×DK×M + M×N, far fewer than the standard convolution parameters DK×DK×M×N. Accuracy drops by only about 1%.
Depthwise Separable Convolution Example (C)
int output_count = output_width * output_height * input_channel;
float *output = (float *)malloc(output_count * sizeof(float));
int index = 0;
for (int i = 0; i < output_height; ++i) {
for (int j = 0; j < output_width; ++j) {
for (int k = 0; k < input_channel; ++k) {
float sum = bias[k];
for (int m = 0; m < kernel_size; ++m) {
for (int n = 0; n < kernel_size; ++n) {
int ypos = i * stride + m - padding;
int xpos = j * stride + n - padding;
if (ypos < 0 || ypos >= input_height) continue;
if (xpos < 0 || xpos >= input_width) continue;
float x = input[ypos * input_width * input_channel + xpos * input_channel + k];
float w = weights[(m * kernel_size + n) * input_channel + k];
sum += w * x;
}
}
sum = fmax(sum, 0.f);
sum = fmin(sum, 6.f);
output[index++] = sum;
}
}
}Training the Model
MobileNet is open‑source (TensorFlow). Pre‑trained weights cover 1001 classes, so a custom dataset is needed.
Data Collection
Open source datasets (ImageNet, MS‑COCO, CIFAR‑10)
User‑reported image URLs
Web crawling with tools like Scrapy or Pyspider
Data Pre‑processing
Divide images among team members for manual labeling
Use annotation tools such as Labelme or labelImg
Model Training
Training uses TensorFlow (slim API) with TensorBoard for monitoring. Common issues and remedies:
Over‑fitting – add regularization, dropout, or increase data diversity.
Under‑fitting – adjust learning rate, increase epochs, reduce regularization.
Slow training – employ GPU acceleration or early‑stop based on accuracy thresholds.
Memory shortage – use chunked training to avoid loading all data at once.
Transfer Learning
Replace MobileNet’s final layer with a new fully‑connected layer for the target classes, cache the original output, and train only the new weights.
Terminal Deployment
TensorFlow Lite is chosen for mobile inference due to its small binary size (~1 MB) and lower power consumption.
Model Conversion
bazel run --config=opt \
//tensorflow/contrib/lite/toco:toco -- \
--input_file=/tmp/mobilenet_v1_0.50_128/frozen_graph.pb \
--output_file=/tmp/foo.tflite \
--input_format=TENSORFLOW_GRAPHDEF \
--output_format=TFLITE \
--inference_type=FLOAT \
--input_shape=1,128,128,3 \
--input_array=input \
--output_array=MobilenetV1/Predictions/Reshape_1For a quantized model (uint8), use:
bazel run --config=opt \
//tensorflow/contrib/lite/toco:toco -- \
--input_file=/tmp/some_quantized_graph.pb \
--output_file=/tmp/foo.tflite \
--input_format=TENSORFLOW_GRAPHDEF \
--output_format=TFLITE \
--inference_type=QUANTIZED_UINT8 \
--input_shape=1,128,128,3 \
--input_array=input \
--output_array=MobilenetV1/Predictions/Reshape_1 \
--mean_value=128 \
--std_value=127Running on Device
iOS example (Objective‑C):
NSString* graph_path = FilePathForResourceName(model_file_name, @"tflite");
model = tflite::FlatBufferModel::BuildFromFile([graph_path UTF8String]);
if (!model) LOG(FATAL) << "Failed to mmap model " << graph_path;
// Build interpreter, allocate tensors, etc.Android example (C++):
int input = interpreter->inputs()[0];
uint8_t* out = interpreter->typed_tensor<uint8_t>(input);
// Fill input tensor, invoke interpreter, retrieve top‑N results.GPU Accelerated Model
To overcome CPU‑bound inference latency and heat, GPU acceleration is applied.
GPU Programming Choices
iOS – Metal (leverages GPU fully)
Android – OpenGL ES 3.1 with Compute Shaders
Weight Extraction
Weights are extracted from the .tflite file (NWHC layout) and optionally reordered to NCWH for GPU kernels.
Shader Implementations
Metal kernel for depthwise convolution (quantized):
kernel void depthwiseConv_quantized(
texture2d_array<half, access::read> inTexture [[texture(0)]],
texture2d_array<half, access::write> outTexture [[texture(1)]],
constant KernelParams& params [[buffer(0)]],
const device int* weights [[buffer(1)]],
const device int4* biasTerms [[buffer(2)]],
ushort3 gid [[thread_position_in_grid]]) {
// Compute depthwise convolution with quantization handling.
}OpenGL ES compute shader (C‑style):
void main(){
int idx = int(gl_GlobalInvocationID.x);
int idy = int(gl_GlobalInvocationID.y);
int idz = int(gl_GlobalInvocationID.z);
// Perform depthwise convolution and quantization.
}GPU Acceleration Results
On iPhone X, GPU inference is ~4× faster than CPU; on Huawei P9, ~3× faster. Power consumption drops by ~20% with noticeable improvements in speed and heat.
Practical Outcomes
The project demonstrates end‑to‑end engineering of a deep neural network for mobile AI: model selection, data handling, training, transfer learning, lightweight deployment with TensorFlow Lite, and substantial performance gains via GPU acceleration.
References
https://blog.csdn.net/u011974639/article/details/79199306
https://arxiv.org/pdf/1704.04861.pdf
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md
https://www.tensorflow.org/mobile/tflite/
https://arm-software.github.io/opengl-es-sdk-for-android/compute_intro.html
https://mooc.study.163.com/university/deeplearning_a
https://github.com/hollance/Forge
https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
Tencent TDS Service
TDS Service offers client and web front‑end developers and operators an intelligent low‑code platform, cross‑platform development framework, universal release platform, runtime container engine, monitoring and analysis platform, and a security‑privacy compliance suite.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
