Cloud Native 15 min read

ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform

This article explains how ByteDance redesigned its machine‑learning platform using cloud‑native principles, detailing motivations, the shift from Yarn to Kubernetes, the implementation of PS‑Worker and AllReduce frameworks, unified operators, heterogeneous resource scheduling, elastic training, and future directions for large‑scale AI workloads.

DataFunTalk
DataFunTalk
DataFunTalk
ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform

ByteDance, a company that heavily invests in artificial intelligence, built a comprehensive product matrix covering data processing, model development, offline training, and online inference to support its fast‑growing products such as Douyin and Toutiao.

Facing billions of users and increasingly complex business scenarios, the existing machine‑learning platform struggled with development experience, training efficiency, task orchestration, and resource operations. The company therefore embarked on a cloud‑native transformation of its ML system.

Motivation : The platform needed to improve resource stability, homogeneity, and network topology awareness for both PS‑Worker and AllReduce training frameworks.

PS‑Worker framework (used in promotion‑search scenarios) separates parameter servers (PS) and workers. PS must be high‑priority, homogeneous, and network‑aware, while workers must be elastic and tolerant of failures.

AllReduce framework (used in CV/NLP scenarios) relies on workers that store the full model, requiring homogeneous GPUs, NUMA affinity, and robust fault‑tolerant scheduling.

Cloud‑native foundation : The team adopted CNCF‑defined cloud‑native concepts—Kubernetes, containers, service mesh, immutable infrastructure, and declarative APIs—to address these challenges.

Kubernetes architecture includes the Master (ETCD, API server, scheduler, controller manager) and Kubelet agents that manage pod lifecycles and node resources.

Extensibility is achieved via scheduler plugins, custom operators, CRI/CNI/CSI standards, and device plugins for heterogeneous resources such as GPUs, Habana, and RDMA.

Machine‑learning platform overview : A one‑stop platform provides experiment environments, sandbox debugging, feature engineering, offline training, and online inference. Unified operators support both online deployment (native Deployments) and offline training (custom TrainingJob Operator).

Online inference optimizations include GPU sharing down to 0.1‑card granularity, multi‑dimensional GPU scheduling, MPS containerization, NUMA‑aware placement, and detailed monitoring via DCGM and NVML, as well as an internal nvidia‑smi replacement for container diagnostics.

Offline training is powered by a unified TrainingJob Operator that supports Lagrange, TensorFlow, and PyTorch, enabling dynamic learning‑rate adjustments, elastic scaling, and integration with HPA.

Unified scheduling across tens of thousands of K8s nodes and a 50k‑node Yarn cluster is realized with a custom distributed scheduler consisting of Dispatcher, Scheduler, and Binder, supporting DRF, gang scheduling, priority preemption, and fair‑share.

Heterogeneous micro‑topology scheduling uses a custom CNR CRD to store node topology, a scheduler plugin to bind pods respecting CPU, memory, GPU, and network affinities, and a NumaAffinity DevicePlugin for final placement.

Elastic training across clusters leverages Virtual‑Kubelet to treat remote clusters as virtual nodes, allowing online resources to be borrowed for offline jobs, with an autoscaler and TrainingJob Operator handling replica adjustments.

Future outlook includes strengthening cluster federation, adopting Argo for DAG‑based ML pipelines, integrating Alluxio for data caching, and exploring deeper resource pooling between CPU and GPU across clusters.

The presentation concludes with a summary of ByteDance’s cloud‑native practice, emphasizing efficiency gains through standardization and cost reductions via fine‑grained elasticity and mixed‑resource utilization.

cloud nativeKubernetesResource Schedulingmachine-learningelastic-trainingheterogeneous-compute
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.