How to Engineer MobileNet for Efficient Image Classification on Mobile Devices

This article details the engineering of MobileNet V1 for image classification on mobile terminals, covering its depthwise separable convolution architecture, data collection and preprocessing, model training with transfer learning, TensorFlow Lite conversion, deployment on iOS/Android, and GPU acceleration techniques for faster inference.

Tencent TDS Service
Tencent TDS Service
Tencent TDS Service
How to Engineer MobileNet for Efficient Image Classification on Mobile Devices

MobileNet Model

MobileNet is a lightweight deep neural network designed by Google for mobile devices, offering low parameter count and computational cost while achieving strong performance on image classification and object detection tasks. The article uses MobileNet V1 for image classification.

MobileNet employs depthwise separable convolutions, splitting a standard convolution into a depthwise convolution and a pointwise (1x1) convolution, dramatically reducing parameters and computation.

The total parameters of a depthwise separable convolution are DK×DK×M + M×N, far fewer than the standard convolution parameters DK×DK×M×N. Accuracy drops by only about 1%.

Depthwise Separable Convolution Example (C)

int output_count = output_width * output_height * input_channel;
float *output = (float *)malloc(output_count * sizeof(float));
int index = 0;
for (int i = 0; i < output_height; ++i) {
    for (int j = 0; j < output_width; ++j) {
        for (int k = 0; k < input_channel; ++k) {
            float sum = bias[k];
            for (int m = 0; m < kernel_size; ++m) {
                for (int n = 0; n < kernel_size; ++n) {
                    int ypos = i * stride + m - padding;
                    int xpos = j * stride + n - padding;
                    if (ypos < 0 || ypos >= input_height) continue;
                    if (xpos < 0 || xpos >= input_width) continue;
                    float x = input[ypos * input_width * input_channel + xpos * input_channel + k];
                    float w = weights[(m * kernel_size + n) * input_channel + k];
                    sum += w * x;
                }
            }
            sum = fmax(sum, 0.f);
            sum = fmin(sum, 6.f);
            output[index++] = sum;
        }
    }
}

Training the Model

MobileNet is open‑source (TensorFlow). Pre‑trained weights cover 1001 classes, so a custom dataset is needed.

Data Collection

Open source datasets (ImageNet, MS‑COCO, CIFAR‑10)

User‑reported image URLs

Web crawling with tools like Scrapy or Pyspider

Data Pre‑processing

Divide images among team members for manual labeling

Use annotation tools such as Labelme or labelImg

Model Training

Training uses TensorFlow (slim API) with TensorBoard for monitoring. Common issues and remedies:

Over‑fitting – add regularization, dropout, or increase data diversity.

Under‑fitting – adjust learning rate, increase epochs, reduce regularization.

Slow training – employ GPU acceleration or early‑stop based on accuracy thresholds.

Memory shortage – use chunked training to avoid loading all data at once.

Transfer Learning

Replace MobileNet’s final layer with a new fully‑connected layer for the target classes, cache the original output, and train only the new weights.

Terminal Deployment

TensorFlow Lite is chosen for mobile inference due to its small binary size (~1 MB) and lower power consumption.

Model Conversion

bazel run --config=opt \
    //tensorflow/contrib/lite/toco:toco -- \
    --input_file=/tmp/mobilenet_v1_0.50_128/frozen_graph.pb \
    --output_file=/tmp/foo.tflite \
    --input_format=TENSORFLOW_GRAPHDEF \
    --output_format=TFLITE \
    --inference_type=FLOAT \
    --input_shape=1,128,128,3 \
    --input_array=input \
    --output_array=MobilenetV1/Predictions/Reshape_1

For a quantized model (uint8), use:

bazel run --config=opt \
    //tensorflow/contrib/lite/toco:toco -- \
    --input_file=/tmp/some_quantized_graph.pb \
    --output_file=/tmp/foo.tflite \
    --input_format=TENSORFLOW_GRAPHDEF \
    --output_format=TFLITE \
    --inference_type=QUANTIZED_UINT8 \
    --input_shape=1,128,128,3 \
    --input_array=input \
    --output_array=MobilenetV1/Predictions/Reshape_1 \
    --mean_value=128 \
    --std_value=127

Running on Device

iOS example (Objective‑C):

NSString* graph_path = FilePathForResourceName(model_file_name, @"tflite");
model = tflite::FlatBufferModel::BuildFromFile([graph_path UTF8String]);
if (!model) LOG(FATAL) << "Failed to mmap model " << graph_path;
// Build interpreter, allocate tensors, etc.

Android example (C++):

int input = interpreter->inputs()[0];
uint8_t* out = interpreter->typed_tensor<uint8_t>(input);
// Fill input tensor, invoke interpreter, retrieve top‑N results.

GPU Accelerated Model

To overcome CPU‑bound inference latency and heat, GPU acceleration is applied.

GPU Programming Choices

iOS – Metal (leverages GPU fully)

Android – OpenGL ES 3.1 with Compute Shaders

Weight Extraction

Weights are extracted from the .tflite file (NWHC layout) and optionally reordered to NCWH for GPU kernels.

Shader Implementations

Metal kernel for depthwise convolution (quantized):

kernel void depthwiseConv_quantized(
    texture2d_array<half, access::read> inTexture [[texture(0)]],
    texture2d_array<half, access::write> outTexture [[texture(1)]],
    constant KernelParams& params [[buffer(0)]],
    const device int* weights [[buffer(1)]],
    const device int4* biasTerms [[buffer(2)]],
    ushort3 gid [[thread_position_in_grid]]) {
    // Compute depthwise convolution with quantization handling.
}

OpenGL ES compute shader (C‑style):

void main(){
    int idx = int(gl_GlobalInvocationID.x);
    int idy = int(gl_GlobalInvocationID.y);
    int idz = int(gl_GlobalInvocationID.z);
    // Perform depthwise convolution and quantization.
}

GPU Acceleration Results

On iPhone X, GPU inference is ~4× faster than CPU; on Huawei P9, ~3× faster. Power consumption drops by ~20% with noticeable improvements in speed and heat.

Practical Outcomes

The project demonstrates end‑to‑end engineering of a deep neural network for mobile AI: model selection, data handling, training, transfer learning, lightweight deployment with TensorFlow Lite, and substantial performance gains via GPU acceleration.

References

https://blog.csdn.net/u011974639/article/details/79199306

https://arxiv.org/pdf/1704.04861.pdf

https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

https://www.tensorflow.org/mobile/tflite/

https://arm-software.github.io/opengl-es-sdk-for-android/compute_intro.html

https://mooc.study.163.com/university/deeplearning_a

https://github.com/hollance/Forge

https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

deep learningGPU accelerationMobileNetTensorFlow LiteMobile Deployment
Tencent TDS Service
Written by

Tencent TDS Service

TDS Service offers client and web front‑end developers and operators an intelligent low‑code platform, cross‑platform development framework, universal release platform, runtime container engine, monitoring and analysis platform, and a security‑privacy compliance suite.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.