Artificial Intelligence 22 min read

Building the ATLAS Automated Machine Learning Platform at Du Xiaoman: Architecture, Practices, and Optimizations

This article describes how Du Xiaoman tackled the high cost, instability, and long cycles of AI algorithm deployment by building the ATLAS automated machine learning platform, detailing its four‑stage workflow, component platforms, scaling and efficiency techniques, and practical Q&A for practitioners.

DataFunTalk

Oct 20, 2023

Building the ATLAS Automated Machine Learning Platform at Du Xiaoman: Architecture, Practices, and Optimizations

Introduction As AI models grow larger and business scenarios become more diverse, Du Xiaoman faces high development costs, strong reliance on experts, unstable algorithms, and long deployment cycles. An automated machine learning (AutoML) platform is presented as the key solution.

Outline The presentation is organized into four parts: (1) Machine‑learning platform, (2) AutoML, (3) Scale and efficiency, (4) Summary and reflections.

Business Scenarios Du Xiaoman, a fintech company, has three main AI‑driven business lines: intelligent risk control (NLP, CV, facial recognition), intelligent acquisition (personalized pricing, recommendation, creative ads), and intelligent operation (graph neural networks, causal inference, OCR). The variety of AI techniques creates significant challenges for algorithm deployment.

AI Deployment Challenges The "impossible triangle" of AI deployment includes high cost, low efficiency, and unstable quality. Specific pain points are: high algorithm development thresholds, heavy hardware resource consumption, unstable results due to expert dependence, and long development‑to‑deployment cycles lasting months.

AI Production Process The end‑to‑end workflow consists of data management, model training, algorithm optimization, and deployment. Each step demands different skill sets, making it hard for a few engineers to master all required technologies.

ATLAS Platform Overview ATLAS spans the full AI production lifecycle and aims to replace manual effort with automation. It comprises four interconnected sub‑platforms:

Annotation platform – produces high‑quality labeled data with multi‑scenario coverage and intelligent pre‑labeling/auto‑correction.

Data platform – handles massive data governance, dynamic sample matching, and real‑time queries over billions of user features.

Training platform – organized into five layers (scheduling, control, functional, application, user) and supports AutoML, parallel and graph computation.

Deployment platform – offers server‑less‑like, low‑cost, high‑availability model serving with APIs for feature processing, prediction, and external data access.

Annotation Platform Details It supports diverse tasks (OCR, face detection, image classification, entity extraction) and improves efficiency through intelligent pre‑labeling and confidence‑based re‑labeling.

Data Platform Details Stores billions of users with thousands of features, enables dynamic, real‑time sample selection, and provides flexible large‑scale data governance.

Training Platform Details Includes a scheduling layer for hardware resources, a control layer for workflow orchestration, a functional layer offering AutoML, parallel and graph computation, an application layer that packages AI capabilities into pipelines, and a user layer for end‑users.

Deployment Platform Details Implements a class‑Serverless architecture tailored for model serving, exposing three API components (feature processing, model prediction, external data access) and achieving rapid (half‑day) model rollout.

Optimization & Iteration Two case studies illustrate ATLAS in action: (1) continuous online model iteration for OCR, where bad cases are re‑labeled and fed back into AutoML to gain ~1% accuracy improvement; (2) AutoML‑guided optimization where AutoML serves as a baseline and expert input refines the best model, delivering 1‑5% performance gains in >60% of internal scenarios.

Scale & Efficiency The platform addresses deep‑learning scale challenges (parameter explosion vs. hardware growth) through data parallelism (transparent, supports both neural nets and boosting models, linear throughput gains), model parallelism (layer‑wise and pipeline parallelism), graph parallelism (O(1) memory mapping for billion‑node graphs), GPU utilization improvements (doubling average usage), and backward‑pass recomputation (reducing memory >50% and speeding training >35%).

Meta‑Learning & NAS Meta‑learning leverages historical task hyper‑parameters to accelerate new tasks, achieving faster convergence and ~1% extra accuracy. Neural Architecture Search (NAS) uses a one‑shot, weight‑entanglement method that is three times faster than DARTS and supports MobileNet, ResNet, etc.

Future Outlook Plans include extending ATLAS to more scenarios, exploring 3D‑parallel training for massive language models, and further narrowing the gap between industry‑leading AI algorithms and internal capabilities.

Q&A Highlights

Recommended open‑source AutoML frameworks: Optuna (preferred for large‑dimensional search), Auto‑Sklearn, AutoWeka; reference site: automl.org.

Platform development took 3‑4 years with a core team of 6‑7 engineers.

AutoML aims to finish a full optimization cycle within a day, using modest additional compute (2‑3× the baseline training cost).

Virtualized GPU resources are essential for fine‑grained scheduling; current strategy mixes high‑ and low‑intensity tasks for time‑share reuse.

Multi‑node distributed training can achieve 80‑90% linear speed‑up up to 16 GPUs in typical workloads.

In summary, the ATLAS platform demonstrates that a well‑designed AutoML system can dramatically reduce AI development cost, improve stability, and accelerate deployment across diverse fintech use cases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Deployment Meta Learning AutoML Neural Architecture Search Machine Learning Platform data parallelism Scalable Training

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.