Building the ATLAS Automated Machine Learning Platform at Du Xiaoman: Architecture, Optimization, and Practical Insights
This article details Du Xiaoman's development of the ATLAS automated machine learning platform, covering business scenarios, AI algorithm deployment challenges, the end‑to‑end production workflow, platform components such as annotation, data, training and deployment, as well as optimization techniques like AutoML, meta‑learning, NAS, and large‑scale parallelism, concluding with lessons learned and future directions.
With the rapid growth of AI technologies and exploding model parameter sizes, Du Xiaoman faces high development costs, strong reliance on experts, algorithm instability, and long deployment cycles, prompting the need for an automated machine learning platform.
The ATLAS platform addresses these challenges by providing a full‑stack solution that spans data management, model training, algorithm optimization, and deployment, reducing manual effort and accelerating AI production.
Key business scenarios include intelligent risk control (NLP and CV), intelligent acquisition (personalized pricing and recommendation), and intelligent operations (graph neural networks, causal inference, OCR).
AI algorithm deployment suffers from a "trilemma" of high cost, low efficiency, and unstable quality; ATLAS mitigates this by automating data labeling, feature engineering, model selection, training, and deployment.
ATLAS consists of four tightly integrated platforms: a labeling platform for high‑quality training data, a data platform for large‑scale governance and dynamic sample matching, a training platform with scheduling, control, functional, application, and user layers, and a deployment platform offering serverless‑like, low‑cost, high‑availability serving.
Optimization techniques include:
AutoML pipelines that combine hyper‑parameter search, meta‑learning, and neural architecture search to reduce development cycles from months to days.
Meta‑learning that leverages historical tasks to guide new task hyper‑parameters, improving convergence speed and model performance.
One‑shot NAS with weight‑entanglement, achieving 3× faster search than DARTS while controlling model size and compute.
Data parallelism supporting both deep learning and boosting models, providing linear throughput gains.
Model parallelism (layer‑wise and pipeline) for extremely large fully‑connected layers.
Graph parallelism with O(1) memory mapping to handle billion‑node graphs.
Training efficiency improvements such as GPU utilization optimization and checkpoint‑based recomputation, cutting memory usage by over 50% and speeding up training by more than 35%.
The platform has been deployed for OCR, face recognition, and other CV/NLP tasks, achieving 1‑5% accuracy gains and near‑linear scaling across multiple GPUs.
In summary, ATLAS demonstrates that a well‑designed machine learning platform combined with AutoML is essential for reducing AI deployment costs, improving stability, and meeting large‑scale efficiency requirements.
Future work includes extending ATLAS to more scenarios, exploring 3D parallelism for massive language models, and further narrowing the gap between industry‑leading AI algorithms and internal capabilities.
The article concludes with a Q&A covering open‑source AutoML frameworks, development timelines, resource virtualization, and performance scaling.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.