Alibaba Cloud Native Deep Learning Platform PAI‑DLC: Architecture, Features, and Future Outlook
This article introduces Alibaba Cloud's PAI‑DLC, a cloud‑native deep learning platform that integrates machine‑learning capabilities, containerized services, AI‑aware scheduling, GPU virtualization, elastic training with EasyScale, data access, and observability, and discusses its architecture, key features, and future directions.
Introduction – With the rapid development of deep learning, building robust AI infrastructure is essential for industrial intelligence, making deep learning platforms a foundational technology. This talk shares the practice and implementation of Alibaba Cloud's cloud‑native deep learning platform PAI‑DLC.
Agenda – The presentation covers three parts: (1) Machine‑learning platform overview, (2) DLC architecture design, and (3) Future outlook.
1. Machine‑learning platform overview
The platform provides four core capabilities: data processing, model development, model training, and model deployment.
Data processing includes data preprocessing and feature engineering.
Model development supports traditional algorithms (e.g., XGBoost, SVM) and deep‑learning frameworks, with interactive tools such as Jupyter Notebook/WebIDE.
Model training leverages frameworks like PyTorch, TensorFlow, PaddlePaddle, and various algorithm engines, handling data storage, caching, and heterogeneous hardware (CPU, GPU, DCU).
Model deployment offers serving frameworks (ONNX, TF‑Serving) and custom inference engines, often accelerated by heterogeneous hardware.
2. Machine‑learning platform architecture
The PAI platform follows a PaaS model built on IaaS resources (CPU, GPU, FPGA, NPU) and Alibaba Cloud Container Service (ACK). It integrates open‑source frameworks and Alibaba‑specific Easy series (EasyNLP, EasyRec, EasyCV, etc.), providing data labeling, visual modeling, and interactive modeling, with PAI‑DLC as the core training engine.
Deep learning platform characteristics – Efficiency, user and resource management, cost saving, elasticity, reproducibility, heterogeneous compute, data autonomy, and AutoML.
PAI‑DLC architecture
The stack consists of a hardware‑infrastructure layer (CPU, GPU/DCU, FPGA, RDMA, NAS/OSS), a Kubernetes layer with custom plugins, a PAI control plane built on K8s CRDs (including KubeDL Operator for TF, PyTorch, MPI, XGBoost jobs), DLC services (authentication, OpenAPI, CLI/SDK), and higher‑level SaaS applications.
Key capabilities
Containerization – Both service and engine containers are provided; CI/CD pipelines are built with Argo for image building and deployment.
Open API – Unified resource and permission abstraction, enabling SDK generation and third‑party integration.
AI workload scheduling – Uses custom schedulers (e.g., Volcano, Scheduling Framework) and coscheduling to avoid deadlocks and improve resource utilization.
GPU virtualization & sharing – Alibaba‑developed solution offers fine‑grained memory and compute isolation (down to 1 % or 0.1 GPU), supports multiple GPU models, and enables shared scheduling with pod‑level quota accounting.
EasyScale elastic training – A PyTorch‑based framework that preserves hyper‑parameters during elastic scaling by using “EasyScale threads” to multiplex GPU time while maintaining accuracy, also supporting heterogeneous GPU clusters.
Data access – Supports OSS, NAS, CPFS, local and PVC mounts, data caching with Fluid, data‑affinity scheduling, and strict data isolation for enterprise scenarios.
Observability – Integrates Alibaba SLS and open‑source ELK for log/event collection, exporters (Node, GPU, RDMA) for metrics, and visual dashboards (including TensorBoard) for training monitoring.
3. Future outlook
Future work includes building a unified MLOps experience, handling mixed online/offline workloads, adapting to domestic GPU chips, standardizing platform interfaces across deployment models (public cloud, private cloud, on‑premise), and improving model‑centric workflows.
Q&A
Q1: How does worker scaling down work across machines? – EasyScale coordinates a worker loss by backing up state and launching two threads in a single pod to continue training.
Q2: Will GPU virtualization be open‑sourced? – Companies typically keep core virtualization techniques proprietary for competitive reasons.
Thank you for attending.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.