Cloud Native 21 min read

How ByteDance Scaled Cloud‑Native Infrastructure: Lessons in Multi‑Cluster Scheduling

ByteDance’s cloud‑native transformation details a layered technical system, multi‑year Kubernetes‑based evolution, unified multi‑cluster resource management, and hierarchical scheduling, illustrating how the company achieves high development speed, resource efficiency, and prepares for next‑generation serverless infrastructure.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How ByteDance Scaled Cloud‑Native Infrastructure: Lessons in Multi‑Cluster Scheduling

ByteDance Cloud‑Native Journey

From a layered technical system perspective, ByteDance builds front‑end services such as Toutiao, Douyin, and Xigua on shared technology middle‑platforms and infrastructure services. The platform continuously evolves to support rapid business growth.

ByteDance currently runs over 100,000 online services, more than one million Pods, and updates its business systems roughly every five days. Daily offline tasks exceed 20,000, processing dozens of exabytes of storage.

Cloud‑Native Evolution Timeline

2016 – Launch of Toutiao Cloud Engine (TCE) based on Kubernetes for fast application deployment.

2018 – Micro‑service architecture upgrade; core business migrated to micro‑services, building service framework, mesh, monitoring, and alerting on TCE.

2019 – "Promotion Search" cloud‑native integration, fully containerizing physical‑machine services.

2020 – Offline scheduling integration and storage cloud‑native transformation, simplifying supply‑chain selection and improving operational efficiency.

2021 – Federated multi‑cluster evolution, achieving standardized, unified application orchestration across multi‑cloud environments.

Current infrastructure focus: unified management and scheduling of federated multi‑cluster resources.

Motivation for Cloud‑Native Development

Development efficiency: the cloud‑native resource model reduces operational overhead, allowing teams to focus on core business logic and accelerate iteration.

Resource efficiency: large‑scale resource pooling and flexible scheduling lower overall resource costs.

Technology Generations

DevOps : Emphasizes automation of management and operations, typically using VMs and tools like Jenkins for monolithic application deployment.

Cloud Native : Micro‑service‑centric, using containers and Kubernetes for flexible deployment and improved developer productivity.

Serverless : Developers write functions or minimal micro‑services; the underlying platform handles capacity, routing, and governance, further boosting development and production efficiency.

Product Forward Integration & Resource Scale

The two driving paths are product forward integration—standardizing and abstracting common business logic, data models, and resource management into the infrastructure—and resource scale, which optimizes large resource pools through pooling, mixing, and advanced scheduling.

Future Directions

Infrastructure without management.

Automatic scaling and elasticity.

Improved development efficiency.

Improved resource efficiency.

Pay‑as‑you‑go cost savings.

Resource Management Practice

After unifying cloud‑native transformation, ByteDance faces the challenge of efficiently managing and operating the group’s resources. The ideal model provides developers a single entry point to obtain resources from a unified pool, enabling on‑demand, flexible access similar to “tap water”.

The unified resource pool spans multiple regions and compute architectures, offering global optimal efficiency and flexible allocation across business lines.

Challenges include handling diverse workload types, ensuring isolation for strong‑security requirements, and maintaining performance, safety, and cost efficiency.

Unified Resource Management Challenges & Benefits

Challenges

Co‑managing heterogeneous workloads (e.g., long‑running services and batch jobs) in the same queue or cluster.

Balancing strong isolation with low‑cost resource sharing within the same nodes.

Addressing performance, security, and pricing impacts of unified management.

Benefits

Transparent resource cost visibility across business lines, facilitating resource‑operating analysis.

Increased resource delivery elasticity, enabling easy cross‑business coordination and pool optimization.

Solution Approaches

Abstract resource‑selling models to let business lines express precise needs.

Build a unified quota management platform for flexible developer control.

Implement a hierarchical resource scheduling system for rapid, flexible delivery.

Resource Model Abstraction

CPU resources are categorized into three levels:

Dedicated Core : Exclusive physical cores, optionally with NUMA topology awareness.

Shared Core : Pods share CPU pools, providing finer‑grained isolation.

Reclaimed Core : Shared cores with low‑priority reclamation for elastic capacity.

Elastic Delivery Models

OnDemand : Ideal “as‑you‑need” model, currently used at small scale for function workloads.

Reserved : Guarantees resource quantity for stable services.

Spot : Bidding‑based allocation of surplus resources, supporting various core tiers.

Hierarchical Scheduling

Effective resource delivery requires a complete scheduling system composed of three layers:

Node‑level Scheduler : Extends Kubernetes with micro‑topology awareness and QoS Resource Manager to allocate CPU, memory, GPU, and network devices per pod.

Cluster‑level Scheduler : Central dispatcher, parallel schedulers, and binder handle predicate and priority calculations, supporting both long‑running and batch workloads.

Global Scheduler : Federated layer enables million‑node scale scheduling across data centers, balancing application priority and regional resource pools.

ByteDance’s global scheduler builds on KubeFed V2, adding transparent federation semantics and unified multi‑region disaster‑recovery and standard compute capabilities.

Microservice Governance

ByteDance operates over 100,000 microservices with complex dependencies. A unified application‑centric solution is needed to manage services within a sub‑business line, reduce network encryption overhead, and simplify architecture evolution.

Resource Utilization vs. Effective Utilization

While average cluster utilization exceeds 40%, the gap between raw utilization and actual business cost remains. Future work focuses on directly addressing application‑level capacity costs.

Third‑Generation Infrastructure Product Iteration

Simplify developer experience, especially for function‑based logic and capacity management.

Shift from single‑service management to distributed‑application management platforms.

Provide finer‑grained isolation units with advanced code distribution, resource packaging, and runtime control.

These practices are encapsulated in the open‑source KubeWharf project, contributing back to the community.

Volcano Engine, ByteDance’s cloud service platform, abstracts these cloud‑native practices into a full suite of products, including container services, image repositories, service mesh, observability, resilience, and intelligent multi‑layer scheduling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeServerlessKubernetesDevOpsresource scheduling
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.