Artificial Intelligence 12 min read

Deploying Multiple CNN Models on Low‑End Devices with MNN: Memory Tricks and Error Debugging

This article explains how a high‑traffic map service captures road features using client‑side computer‑vision models, details the deployment of many CNNs with the lightweight MNN engine on memory‑constrained devices, and shares practical memory‑saving techniques, inference scheduling, and error‑analysis methods.

Amap Tech

Jun 4, 2021

Deploying Multiple CNN Models on Low‑End Devices with MNN: Memory Tricks and Error Debugging

Background

The navigation platform needs to run more than ten CNN models on low‑performance client devices to extract road‑level features (traffic cameras, road conditions, signage) in real time. The lightweight MNN inference engine is used to deploy these models on‑device.

Memory Consumption

During model inference memory is allocated by four main components:

ModelBuffer : the deserialized model data, roughly the size of the original model file.

FeatureMaps : intermediate tensors for each layer.

ModelParams : static parameters such as weights, biases and operator definitions (weights dominate this portion).

Heap/Stack : runtime heap and stack memory.

Memory Optimization

To keep the peak memory low the following practices are applied:

After createFromFile and createSession, release the model buffer with releaseModel to avoid cumulative memory usage.

Reuse the same memory region for the input image and the inputTensor.

Reuse the memory region for the output tensor and the post‑processing result data.

For a 2.4 MB visual model on Android the memory profile is:

0 MB before loading.

~5.24 MB after deserialization and feature‑map allocation.

~3.09 MB after releasing the model buffer.

~4.25 MB after reusing input tensor memory.

~5.76 MB after running the session (additional stack/heap usage).

Memory returns to the initial level after the model is released.

The peak memory for a single model can be approximated by the formula:

MemoryPeak = StaticMemory + DynamicMemory + MemoryHS

where:

StaticMemory = memory occupied by weights, biases and operators.

DynamicMemory = memory for feature‑maps generated during inference.

MemoryHS = estimated stack/heap overhead (typically 0.5–2 MB).

Inference Scheduling

MNN allows flexible scheduling of model paths and back‑ends. Different branches of a network can be assigned to different execution units (e.g., detection branch on CPU, segmentation branch on OpenGL) to improve parallelism and reduce overall latency.

Inference Process

The inference workflow consists of:

Deserializing the model file.

Creating a Session with the desired scheduling configuration.

Executing operators layer‑by‑layer. If an operator is not supported on the selected back‑end, MNN automatically falls back to a secondary back‑end.

Deployment Timing

Timing measurements on target devices show that model deserialization and session creation dominate the cost. Therefore these steps should be performed once and reused for multiple image inferences.

Model Error Analysis

When the on‑device inference results differ from those obtained on an X86 training environment, the following procedure helps locate the cause:

Fix the input tensor on both platforms and compare the raw outputs to confirm a discrepancy.

Eliminate input‑related errors; floating‑point representation differences can amplify errors.

Use runSessionWithCallBack to capture intermediate results of each operator and identify the first operator that produces a divergent value.

Once the problematic operator is identified, inspect the back‑end implementation to find the exact code line responsible for the deviation.

A practical technique is to set the model input to 0.46875 , a value whose binary representation is identical on X86 and most ARM CPUs. If the outputs still differ, the error is likely caused by an operator implementation rather than input quantisation.

Conclusion

The MNN engine is an effective solution for edge inference on low‑memory devices. Successful deployment requires careful management of memory consumption, scheduling of model paths to appropriate back‑ends, and systematic debugging of operator‑level discrepancies.

Future Work

As device capabilities improve, richer back‑ends such as OpenCL and OpenGL will be leveraged to further accelerate inference. More diverse road‑feature models will be deployed on clients using the MNN framework.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision Android Memory optimization edge AI Model Deployment MNN

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.