Kunpeng: A Scalable Distributed Machine Learning Platform for Billion‑Scale Data
Kunpeng is a unique distributed platform that seamlessly integrates large‑scale system architecture with parallel optimization algorithms, delivering fault‑tolerant, high‑performance machine‑learning capabilities for billions of samples and features, and outperforming Spark, MPI, and XGBoost in real‑world Alibaba applications.
Research Background
In the era of big data, platforms handle massive volumes such as 500 million tweets per day, 50 million parcels, and 120,000 transactions per second. Alibaba’s internal models can reach billions of samples and hundreds of billions of features, creating a need for a distributed machine learning platform capable of training at this scale.
Existing Methods
Current distributed frameworks like Hadoop, Spark, GraphLab, and GraphX support parallel machine learning but struggle with large‑scale algorithms. Spark and Hadoop provide coarse‑grained operators, while GraphLab/GraphX focus on graph computation and lack suitability for general ML. MPI offers low‑level parallelism but lacks fault tolerance, especially with many workers.
Motivation and Innovation of Kunpeng
Named after the mythic fish and bird, Kunpeng combines a massive distributed computing system (“Kun”) with a large‑scale distributed optimization algorithm (“Peng”). This integration enables “one‑fly‑to‑the‑sky” performance.
System Innovations
Robust fault‑tolerance even in busy online clusters
Backup Instance for Straggler Management
Support for DAG‑based scheduling and synchronization (BSP/SSP/ASP)
User‑friendly interfaces and programming APIs
Algorithm Innovations
Kunpeng enables large‑scale implementations of many algorithms, including LR, FTRL, MART, FM, HashMF, DSSM, DNN, and LDA.
Kunpeng Architecture
The architecture builds on Alibaba’s internal Apasra platform and includes features such as Robust Failover, Backup Instance, and DGA for scheduling and synchronization.
Core modules:
Server nodes: shard model storage
Worker nodes: shard training data and compute
Coordinator: controls algorithm flow (initialization, iteration, termination)
ML Bridge: script‑based data preprocessing
PS‑Core: parameter server (servers, workers, coordinator)
Fuxi: monitors machine status and performs fault recovery
User Perspective
Users invoke algorithms with a few script commands, e.g.,
ps_train -i demo_batch_input -o demo_batch_result -a xxAlgo -t, followed by termination and evaluation steps. The process is transparent and smooth.
Developer Perspective
Developers use simple APIs such as Worker.PullFrom(Server) and SyncBarrier() to handle communication and synchronization.
Experimental Results
Comparison with Spark and MPI
On seven datasets, Kunpeng’s logistic regression (LR) training time and memory consumption outperform Spark and MPI, especially on high‑dimensional data where Spark and MPI fail.
Kunpeng‑MART vs XGBoost
Kunpeng‑MART uses less memory than XGBoost across four datasets and achieves faster training times; XGBoost often fails on large datasets.
Impact of Worker Count
Increasing workers accelerates training and reduces per‑worker memory usage. With 25 workers, Kunpeng trains a sparse LR model on 70 billion features and 180 billion samples.
Conclusion
Kunpeng provides powerful distributed computation and algorithm optimization, handling billions of features, trillions of parameters, and delivering robustness, flexibility, scalability, and efficiency in production. It boosted CTR by 21% and GMV by 10% during Alibaba’s 2015 Double‑11 event, and improved fraud‑detection recall from 91% to 98% in Alipay.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
