Artificial Intelligence 16 min read

Alibaba Cloud Native Deep Learning Platform PAI‑DLC: Architecture, Features, and Future Outlook

This article introduces Alibaba Cloud's PAI‑DLC, a cloud‑native deep learning platform that integrates machine‑learning capabilities, containerized services, AI‑aware scheduling, GPU virtualization, elastic training with EasyScale, data access, and observability, and discusses its architecture, key features, and future directions.

DataFunSummit
DataFunSummit
DataFunSummit
Alibaba Cloud Native Deep Learning Platform PAI‑DLC: Architecture, Features, and Future Outlook

Introduction – With the rapid development of deep learning, building robust AI infrastructure is essential for industrial intelligence, making deep learning platforms a foundational technology. This talk shares the practice and implementation of Alibaba Cloud's cloud‑native deep learning platform PAI‑DLC.

Agenda – The presentation covers three parts: (1) Machine‑learning platform overview, (2) DLC architecture design, and (3) Future outlook.

1. Machine‑learning platform overview

The platform provides four core capabilities: data processing, model development, model training, and model deployment.

Data processing includes data preprocessing and feature engineering.

Model development supports traditional algorithms (e.g., XGBoost, SVM) and deep‑learning frameworks, with interactive tools such as Jupyter Notebook/WebIDE.

Model training leverages frameworks like PyTorch, TensorFlow, PaddlePaddle, and various algorithm engines, handling data storage, caching, and heterogeneous hardware (CPU, GPU, DCU).

Model deployment offers serving frameworks (ONNX, TF‑Serving) and custom inference engines, often accelerated by heterogeneous hardware.

2. Machine‑learning platform architecture

The PAI platform follows a PaaS model built on IaaS resources (CPU, GPU, FPGA, NPU) and Alibaba Cloud Container Service (ACK). It integrates open‑source frameworks and Alibaba‑specific Easy series (EasyNLP, EasyRec, EasyCV, etc.), providing data labeling, visual modeling, and interactive modeling, with PAI‑DLC as the core training engine.

Deep learning platform characteristics – Efficiency, user and resource management, cost saving, elasticity, reproducibility, heterogeneous compute, data autonomy, and AutoML.

PAI‑DLC architecture

The stack consists of a hardware‑infrastructure layer (CPU, GPU/DCU, FPGA, RDMA, NAS/OSS), a Kubernetes layer with custom plugins, a PAI control plane built on K8s CRDs (including KubeDL Operator for TF, PyTorch, MPI, XGBoost jobs), DLC services (authentication, OpenAPI, CLI/SDK), and higher‑level SaaS applications.

Key capabilities

Containerization – Both service and engine containers are provided; CI/CD pipelines are built with Argo for image building and deployment.

Open API – Unified resource and permission abstraction, enabling SDK generation and third‑party integration.

AI workload scheduling – Uses custom schedulers (e.g., Volcano, Scheduling Framework) and coscheduling to avoid deadlocks and improve resource utilization.

GPU virtualization & sharing – Alibaba‑developed solution offers fine‑grained memory and compute isolation (down to 1 % or 0.1 GPU), supports multiple GPU models, and enables shared scheduling with pod‑level quota accounting.

EasyScale elastic training – A PyTorch‑based framework that preserves hyper‑parameters during elastic scaling by using “EasyScale threads” to multiplex GPU time while maintaining accuracy, also supporting heterogeneous GPU clusters.

Data access – Supports OSS, NAS, CPFS, local and PVC mounts, data caching with Fluid, data‑affinity scheduling, and strict data isolation for enterprise scenarios.

Observability – Integrates Alibaba SLS and open‑source ELK for log/event collection, exporters (Node, GPU, RDMA) for metrics, and visual dashboards (including TensorBoard) for training monitoring.

3. Future outlook

Future work includes building a unified MLOps experience, handling mixed online/offline workloads, adapting to domestic GPU chips, standardizing platform interfaces across deployment models (public cloud, private cloud, on‑premise), and improving model‑centric workflows.

Q&A

Q1: How does worker scaling down work across machines? – EasyScale coordinates a worker loss by backing up state and launching two threads in a single pod to continue training.

Q2: Will GPU virtualization be open‑sourced? – Companies typically keep core virtualization techniques proprietary for competitive reasons.

Thank you for attending.

cloud-nativedeep learningKubernetesAI PlatformGPU virtualizationelastic-training
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.