Walle: An End‑to‑End, General‑Purpose, Scalable Edge‑Cloud Collaborative Machine Learning System
The article introduces Walle, Alibaba's four‑year‑old edge‑cloud collaborative machine‑learning platform that unifies compute containers, data pipelines, and a deployment platform to enable low‑latency, privacy‑preserving, and high‑throughput AI services across billions of mobile devices, and presents its architecture, design challenges, and evaluation results.
This article presents Walle, a four‑year‑long research effort by Alibaba's Mobile Taobao Meta team to build a universal, scalable edge‑cloud collaborative machine‑learning system.
Background & Motivation: Rapid advances in mobile hardware and AI have enabled rich on‑device services, but cloud‑centric pipelines suffer from high latency, high cost, and privacy risks. Leveraging billions of devices requires moving part of the ML workload to the edge.
Overall Goal & Architecture: Walle follows end‑to‑end, general‑purpose, and industrial‑scale design principles, supporting any ML stage on either side. Its architecture comprises three core modules: a compute container, a data pipeline, and a deployment platform, all orchestrated to enable flexible task placement.
Compute Container: Provides a lightweight, cross‑platform execution environment using the MNN deep‑learning framework and a custom Python thread‑level VM that removes the GIL and supports task‑level multithreading. It features a half‑automatic operator search, raster operators for geometric computation, and hand‑optimized kernels for heterogeneous back‑ends.
Data Pipeline: Introduces a novel on‑device stream‑processing framework with trie‑based trigger management, enabling stateful processing of user behavior streams. A real‑time, SSL‑optimized long‑connection channel transfers ~30 KB of data within 500 ms, reducing cloud load and preserving privacy.
Deployment Platform: Handles day‑level task iteration for hundreds of ML tasks across billions of devices. It supports unified and personalized deployment strategies, short‑connection push‑pull mechanisms, multi‑batch releases, and rollback safety, ensuring robustness at massive scale.
Evaluation Results: In e‑commerce live‑streaming and recommendation scenarios, Walle cuts cloud load by 87 %, increases video coverage by 123 %, and improves per‑device latency to <150 ms. The new data pipeline reduces IPV feature latency to 44 ms and cuts data volume by >90 %. Benchmarks show MNN outperforms TensorFlow Lite and PyTorch Mobile on most hardware, and the thread‑level VM achieves significant speedups over CPython.
Summary: Walle is the first end‑to‑end, general‑purpose, industrial‑scale edge‑cloud system, combining a high‑performance compute container, an on‑device data pipeline, and a robust deployment platform to deliver fast, privacy‑aware AI services at billions of devices.
Q&A Highlights: The system impacts Alibaba’s recommendation and live‑stream metrics, its compute container is built on open‑source MNN 2.0, it differs from federated learning by operating at the system level, and over 1,000 models have been deployed, with ~300 active today.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.