Running a CNN on Mobile: TensorFlow & OpenCV Document Detection Guide
This article walks through a real‑world mobile implementation of a convolutional neural network for document detection, covering problem definition, limitations of traditional OpenCV pipelines, the adoption of a HED edge‑detection network, data preparation, model training, TensorFlow library trimming, and deployment tricks for iOS and Android.
Mobile CNN Document Detection with TensorFlow & OpenCV
1. Introduction
This piece is not a beginner tutorial on neural networks or machine learning; it demonstrates key techniques for running a CNN on a mobile device using a real product case.
2. Requirement
The goal is to locate the four corner coordinates of a rectangular document within an image.
3. Traditional Technical Solutions
Typical OpenCV tutorials rely on cv2.Canny() and cv2.findContours(), but they work only on ideal demo images. Real‑world photos contain noise, broken edges, and non‑rectangular contours, requiring extensive tuning.
4. Limitations of Traditional Methods
Edge detection depends on manually set thresholds, reducing robustness.
The mathematical model built on the edge map is complex and often fails on irregular edges.
5. Rethinking the Approach
After exhausting traditional tuning, the team turned to machine learning to improve the two critical steps: edge detection and rectangle extraction.
6. Ineffective Neural‑Network Attempts
6.1 End‑to‑End Regression
Directly regressing the four corner points failed because the problem is not purely regression‑friendly.
6.2 YOLO & FCN
YOLO for object detection and FCN for semantic segmentation did not achieve the required precision and were too heavy for real‑time mobile inference.
7. Effective Neural‑Network Solution
The team replaced the Canny step with a neural network that performs edge detection, simplifying the subsequent geometric algorithm.
7.1 Network Input/Output
The network takes an image and outputs an enhanced edge map suitable for rectangle extraction.
7.2 HED (Holistically‑Nested Edge Detection) Network
HED, built on VGG16, uses multi‑scale feature fusion. Unnecessary fully‑connected and softmax layers are removed, and only the five convolutional groups are retained.
8. Training the Network
8.1 Loss Function
All scales originally contributed to the loss; later only the fused final output was used, improving edge thinness.
8.2 Transposed Convolution Initialization
Bilinear up‑sampling kernels and a small learning rate were employed to aid convergence.
8.3 Cold‑Start Training
Training began with a small sample set (≈2000 images) for a few thousand iterations; if convergence was not observed, the run was aborted and restarted.
9. Training Dataset
Both synthetic (≈80,000 images) and manually annotated real images (≈1,200) were used to cover diverse perspectives and backgrounds.
10. Running TensorFlow on Mobile Devices
10.1 Using TensorFlow Libraries
iOS and Android are supported, but protobuf version conflicts may require namespace adjustments or manual library patches.
10.2 Deploying the Trained Model
After training, the checkpoint is converted to a frozen .pb file, which can be loaded directly via the TensorFlow C++ API on mobile.
11. Debugging a Crash
A missing Mul operation error was traced to unsupported TensorFlow ops on mobile; the offending code was rewritten to avoid tf.shape and tf.pack in deconvolution.
12. Trimming TensorFlow
Only the required ops (46 out of 200+) were kept in tf_op_files.txt, reducing the library size dramatically.
13. Model Pruning
By reducing the number of filters in each VGG group, the model size dropped from 56 MB to 4.2 MB while maintaining ~0.1 s per‑frame inference on an iPhone 7 Plus.
14. Choosing TensorFlow APIs
Higher‑level APIs such as TensorFlow‑Slim were used to improve code readability and reuse.
15. Complementary OpenCV Algorithm
After HED edge detection, HoughLinesP extracts line segments, which are extended, merged, and filtered to compute intersection points and finally select the best rectangle.
16. Summary
Algorithm Perspective
Parameter tuning is largely empirical.
Neural‑network development is an experimental science.
Labeling data is costly and often a bottleneck.
Balancing accuracy, model size, and speed is essential.
Engineering Perspective
When end‑to‑end fails, a pipeline approach with targeted networks works.
Master at least one deep‑learning framework and maintain high code quality.
Learn core patterns and adapt them across problems.
Bridge academic advances with practical engineering constraints.
Tencent TDS Service
TDS Service offers client and web front‑end developers and operators an intelligent low‑code platform, cross‑platform development framework, universal release platform, runtime container engine, monitoring and analysis platform, and a security‑privacy compliance suite.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
