Cloud Native 23 min read

Design and Resource Scheduling of Cloud‑Native AI and the PaddleFlow Workflow Engine

The article explains Baidu’s cloud‑native AI resource scheduling across single‑ and multi‑GPU nodes, describes the PaddleFlow Kubernetes‑based workflow engine with its hierarchical queues, advanced scheduling algorithms, unified storage, and how these technologies improve GPU utilization, reduce fragmentation, and simplify AI task orchestration.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Design and Resource Scheduling of Cloud‑Native AI and the PaddleFlow Workflow Engine

This article is a written version of the first session of Baidu Baige "Cloud‑Native AI" technical open class held in December. It explains resource scheduling and management methods for cloud‑native AI in single‑node single‑GPU, single‑node multi‑GPU, and multi‑node multi‑GPU scenarios, and introduces the architecture and product details of the AI workflow engine PaddleFlow, which helps AI engineers hide underlying resource complexity and seamlessly connect AI tasks with AI resources.

The presentation is divided into three parts: (1) an introduction to cloud‑native AI; (2) resource management and scheduling under cloud‑native AI; (3) an overview of Baidu’s self‑developed AI workflow engine PaddleFlow.

Cloud‑native refers to a set of technologies for building elastic, scalable applications that can run on public, private, or hybrid clouds. Cloud‑native AI applies these principles to AI workloads, integrating container services, elastic scaling, and the full AI lifecycle (data, training, inference, etc.).

Common pain points identified from more than ten enterprises include resource inefficiency (low utilization, heterogeneous chip scheduling, container networking), engineering inefficiency (large model deployment, training/inference speed, image startup time), resource fragmentation, and low GPU utilization. Cloud‑native technologies can significantly alleviate these issues.

The resource management layer of Baidu’s cloud‑native AI provides heterogeneous chip management, high‑performance RDMA networking, and high‑performance storage access. GPU virtualization includes dual‑engine GPU containers, remoteGPU, and Kunlun chip virtualization.

The AI scheduling layer implements a series of actions: enqueue, resource allocation, resource recycling, resource preemption, and back‑fill. It uses a global resource view and tenant quotas to select optimal nodes for AI tasks.

Key concepts include the resource queue, which supports both over‑commit (allowing tasks to exceed quota with tags) and non‑over‑commit modes, and multi‑tenant quota management.

Several scheduling algorithms are described: Gang scheduling (ensuring all Pods of a group are scheduled together), Gang preemption, Binpack (reducing resource fragmentation), GPU offline mixed scheduling (balancing online latency‑sensitive and offline throughput‑oriented workloads), topology‑aware scheduling (considering NVLink and NIC‑GPU topology for multi‑card training), and Tor‑aware scheduling (placing Pods of the same training unit within the same Tor switch to reduce network congestion).

PaddleFlow is an AI workflow engine built on Kubernetes, offering cloud‑native characteristics, unified compute and storage access, and support for major distributed training frameworks. It consists of two main parts: an AI job scheduling system (green) and a resource‑management system (red) that leverages the previously described scheduling algorithms.

Core modules include Pipeline Core, which provides a Python DSL and YAML for defining workflows, checkpoint‑resume capabilities, automatic caching of intermediate results, and advanced DAG features such as sub‑DAGs. A simple example workflow defines three tasks: preprocess → train → validate.

Hierarchical queues and hierarchical scheduling enable elastic quota allocation across multiple users, improving overall cluster utilization while guaranteeing fairness.

PaddleFlowFS provides a unified storage abstraction with a VFS layer, local and disk caching, and support for HDFS, S3‑compatible object stores, etc. Benchmarks show 5‑10× faster first‑read latency and >30% improvement on cached reads compared with traditional HDFS Fuse or S3FS.

The article concludes by noting that future sessions will cover training and inference acceleration components, inviting readers to stay tuned.

cloud nativeAIKubernetesResource SchedulingWorkflow EnginePaddleFlow
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.