Artificial Intelligence 29 min read

How Fast Can Your Smartphone Run ML Models? Exploring Edge AI Optimization

This article examines the computational capabilities of modern mobile devices for machine learning, compares training times on a MacBook and iPhone, explains model evaluation metrics like FLOPs, and provides step‑by‑step guides for converting and optimizing models using TensorFlow, PyTorch, ONNX, JAX, and TVM for edge deployment.

Alibaba Terminal Technology

Jun 22, 2022

How Fast Can Your Smartphone Run ML Models? Exploring Edge AI Optimization

Before introducing edge AI engineering practices, the article asks a fundamental question: how powerful is the compute capability on the device? An experiment with the MNIST handwritten digit recognition project shows that training 60,000 samples for 10 epochs takes 128 seconds on a 2015 15‑inch MacBook Pro i7 CPU but only 86 seconds on an iPhone 13 Pro Max, demonstrating that mobile devices can meet the compute demands of model inference and even training.

On less powerful phones, it is still necessary to measure compute requirements and optimize models for the framework and platform to ensure a good user experience. Model optimization consists of two parts: (1) compression, pruning, quantization, knowledge distillation, etc., and (2) using framework‑provided tools or third‑party tools such as JAX and TVM to convert the model into a platform‑specific optimized version.

Evaluating and Preparing Models

The iPhone 13 Pro Max uses Apple’s A15 processor built on a 5 nm process with 150 billion transistors and an NPU capable of 15.8 TOPS. Model complexity is measured in FLOPs (floating‑point operations). Figure 4‑5 (not shown) compares parameters, model size, and FLOPs of common networks; larger parameter counts generally imply higher FLOPs, though exceptions like AlexNet and ResNet‑152 exist due to architecture differences.

TensorFlow can compute the exact FLOPs of a model:

# TensorFlow recommended FLOPs calculation
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2_as_graph

def get_flops(model):
    concrete = tf.function(lambda inputs: model(inputs))
    concrete_func = concrete.get_concrete_function([
        tf.TensorSpec([1, *inputs.shape[1:]]) for inputs in model.inputs])
    frozen_func, graph_def = convert_variables_to_constants_v2_as_graph(concrete_func)
    with tf.Graph().as_default() as graph:
        tf.graph_util.import_graph_def(graph_def, name='')
        run_meta = tf.compat.v1.RunMetadata()
        opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
        flops = tf.compat.v1.profiler.profile(graph=graph, run_meta=run_meta, cmd="op", options=opts)
        return flops.total_float_ops

print("The FLOPs is:{}".format(get_flops(model)), flush=True)

For PyTorch, the thop library can be used:

# PyTorch OpCounter example
from thop import profile
input = torch.randn(1, 1, 28, 28)
macs, params = profile(model, inputs=(input, ))
print('Total macc:{}, Total params: {}'.format(macs, params))

Apple’s A15 NPU provides 15.8 TOPS, while Qualcomm Snapdragon 855’s combined CPU+GPU+DSP AI capability is only about 7 TOPS, highlighting the need for model adaptation to Android or iOS hardware accelerators.

Model Conversion Techniques

Conversion can be done via framework‑provided APIs (simple but runtime‑dependent) or third‑party tools (more flexible but may have input/output constraints). The article demonstrates TensorFlow’s conversion to TensorFlow Lite:

import tensorflow as tf

# Define a simple Keras model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=1, input_shape=[1]),
    tf.keras.layers.Dense(units=16, activation='relu'),
    tf.keras.layers.Dense(units=1)
])
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(x=[-1, 0, 1], y=[-3, -1, 1], epochs=5)

# Save the model
tf.saved_model.save(model, "saved_model_keras_dir")

# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# Write the .tflite file
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Command‑line conversion of a SavedModel or Keras H5 file:

# Convert SavedModel
python -m tflite_convert \
  --saved_model_dir=/tmp/mobilenet_saved_model \
  --output_file=/tmp/mobilenet.tflite
# Convert H5 model
python -m tflite_convert \
  --keras_model_file=/tmp/mobilenet_keras_model.h5 \
  --output_file=/tmp/mobilenet.tflite

After conversion, the TFLite model’s FLOPs dropped from 2.8 MFLOPs to 1.8 MFLOPs. The open‑source tflite-flops tool can verify this:

# Install and use tflite-flops
pip3 install git+https://github.com/lisosia/tflite-flops
python -m tflite_flops model.tflite
# Sample output (truncated)
OP_NAME | M FLOPS
-------------------
CONV_2D | 0.4
... (other ops omitted) ...
Total: 1.6 M FLOPS

ONNX serves as an open neural‑network exchange format, enabling models to run on various runtimes. Converting a TensorFlow model to ONNX:

# Convert TensorFlow SavedModel to ONNX
pip install -U tf2onnx
python -m tf2onnx.convert \
    --saved-model ./output/saved_model \
    --output ./output/mnist1.onnx \
    --opset 7

ONNX models can be loaded in iOS via CoreML tools:

import coremltools
import onnxmltools
input_coreml_model = 'model.mlmodel'
output_onnx_model = 'model.onnx'
coreml_model = coremltools.utils.load_spec(input_coreml_model)
onnx_model = onnxmltools.convert_coreml(coreml_model)
onnxtools.utils.save_model(onnx_model, output_onnx_model)

Compilation‑Based Optimization

Compiling models with frameworks like XLA (Google), Glow (Meta), JAX, and TVM can yield significant performance gains. XLA separates high‑level optimizer (HLO) IR from backend code generation, leveraging LLVM for multi‑target optimization. JAX offers just‑in‑time compilation, automatic parallelization, vectorization, and differentiation.

Example of preparing a model with JAX and converting it to TFLite:

# Install JAX dependencies
pip install tf-nightly --upgrade
pip install jax --upgrade
pip install jaxlib --upgrade

# JAX model definition and conversion
import numpy as np
import tensorflow as tf
import functools
import jax.numpy as jnp
from jax import jit, grad, random
from jax.experimental import stax

# One‑hot helper omitted for brevity
# Load MNIST data (same as TensorFlow example)
# Define model
init_random_params, predict = stax.serial(
    stax.Flatten,
    stax.Dense(1024), stax.Relu,
    stax.Dense(1024), stax.Relu,
    stax.Dense(10), stax.LogSoftmax)

# Convert to TFLite
serving_func = functools.partial(predict, params)
x_input = jnp.zeros((1, 28, 28))
converter = tf.lite.TFLiteConverter.experimental_from_jax(
    [serving_func], [[('input1', x_input)]])
tflite_model = converter.convert()
with open('jax_mnist.tflite', 'wb') as f:
    f.write(tflite_model)

TVM, an Apache‑backed machine‑learning compiler, focuses on runtime‑aware optimizations and supports a wide range of targets (CPU, GPU, NPU, FPGA, WebGPU). The article walks through downloading a ResNet‑50 ONNX model, building TVM from source, and compiling the model:

# Download ONNX model
wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v2-7.onnx

# Build TVM (macOS example)
brew install gcc git cmake llvm [email protected]
git clone --recursive https://github.com/apache/tvm tvm
cd tvm && mkdir build && cp cmake/config.cmake build
cd build && cmake .. && make -j4

# Install Python package
pip3 install --user numpy decorator attrs tornado psutil xgboost cloudpickle
pip3 install --user onnx onnxoptimizer libomp pillow
export MACOSX_DEPLOYMENT_TARGET=10.9
cd python && python setup.py install --user && cd ..

Compile the model with TVM:

# First‑stage compilation
python -m tvm.driver.tvmc compile \
  --target "llvm" \
  --output resnet50-v2-7-tvm.tar \
  resnet50-v2-7.onnx

# Run inference
python -m tvm.driver.tvmc run \
  --inputs imagenet_cat.npz \
  --output predictions.npz \
  resnet50-v2-7-tvm.tar

Auto‑tuning can further improve performance:

# Auto‑tune
python -m tvm.driver.tvmc tune \
  --target "llvm" \
  --output resnet50-v2-7-autotuner_records.json \
  resnet50-v2-7.onnx

# Compile with tuning records
python -m tvm.driver.tvmc compile \
  --target "llvm" \
  --tuning-records resnet50-v2-7-autotuner_records.json \
  --output resnet50-v2-7-tvm_autotuned.tar \
  resnet50-v2-7.onnx

Benchmarking shows the tuned model runs ~30 % faster on an Intel CPU compared to the untuned version.

Finally, the article notes that WebGPU, the upcoming web graphics API, can bring near‑native GPU performance to browsers. TVM can compile models for WebGPU, enabling high‑performance edge AI in web applications.