Cloud Native 16 min read

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

ByteDance’s cloud‑native ecosystem combines a multi‑layered architecture, dynamic resource over‑provisioning control, hybrid online‑offline scheduling, and federated cluster management to boost container utilization from 23% to 63%, reduce costs by 40%, and support massive events like the 2021 Spring Festival Gala.

Volcano Engine Developer Services

Jul 12, 2022

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

ByteDance Cloud‑Native Architecture

ByteDance’s internal cloud‑native technology spans the entire organization, covering research‑development pipelines, service platforms, infrastructure, SRE, and cloud‑native security. The system has evolved through several stages to support massive online services and large‑scale offline workloads.

Architecture Layers

R&D System Layer: CI/CD pipelines, observability platform, efficiency platform, chaos engineering platform.

Service Platform Layer: Cloud‑native framework, service mesh, serverless computing, edge computing.

Infrastructure Layer: Container management platform, compute‑storage‑network PaaS.

SRE System: Connects R&D processes with infrastructure management.

Cloud‑Native Security: Business, identity, and network security capabilities.

Evolution Timeline

2015‑2017: Launch of TCE platform for container lifecycle management.

2018: Prototype Service Mesh and unified PaaS for compute and storage.

2019: Full rollout of Service Mesh, cluster federation, and unified monitoring.

2020: Edge‑computing enhancements, second‑generation cluster federation, resource‑level monitoring with eBPF.

2021: Cloud products opened to external customers via Volcano Engine public cloud.

Large‑Scale K8s Hybrid Deployment

ByteDance’s private cloud platform TCE uses Kubernetes for orchestration. All stateless services run as containers, leading to rapid growth in cluster size and resource cost concerns. To improve overall resource utilization, a dynamic oversell strategy is applied.

Key steps include:

Resource Control: SysProbe collects container metrics; Spark aggregates data; TCE Platform adjusts deployment requests based on historical usage.

Resource Adjustment: VPA Controller watches long‑standing deployments and updates their resource requests.

Elastic Scaling: Combine with pod autoscaling to reclaim resources during traffic troughs.

Online workload peaks and troughs are analyzed, revealing that services often request more resources than needed, causing low utilization.

Hybrid Online‑Offline Scheduling

Idle online resources are offered to offline tasks (e.g., video transcoding, model training) through a hybrid scheduling framework composed of SysProbe, a machine‑learning‑driven resource estimator, and a Hybrid controller that manages disaster‑recovery and water‑level control.

The scheduling system operates at three levels:

Cluster Level: K8s Scheduler and Yarn ResourceManager place containers on appropriate nodes.

Node Level: QoS Controller adjusts node‑level resource allocation in seconds when online services experience spikes.

Kernel Level: eBPF‑based monitoring and custom CPU/IO schedulers provide low‑latency support for latency‑sensitive online services and throughput‑heavy offline jobs.

Federated Scheduling and Global Quota

Since 2019, a federated system enables multi‑cluster resource pooling, automatic disaster recovery, seamless cluster upgrades, and support for both private‑cloud and public‑cloud IaaS.

User Experience: SRE teams no longer manage individual clusters; upgrades are transparent.

Automatic Disaster Recovery: Full container migration on cluster or data‑center failures.

Operational Efficiency: Rapid cluster onboarding/offboarding.

Multi‑Cloud Support: Easy integration of on‑premise IDC and public‑cloud resources.

QoS Monitoring

eBPF‑based kernel monitoring integrated into SysProbe.

Extended Cgroup metrics for CPU and memory, including throttling and load indicators.

BlockIO monitoring via VFS hooks to capture read/write behavior.

Network‑IO monitoring of socket states and SLA metrics such as SRTT jitter.

These mechanisms continuously detect and resolve issues that affect service QoS, ensuring robust isolation.

Business Impact

After implementing hybrid scheduling, resource utilization rose from an average of 23% to 63%, saving roughly 40% of server costs while keeping online QPS impact minimal. The system also enabled rapid resource borrowing for events such as the 2021 Spring Festival Gala, converting offline capacity to online within five minutes.

Volcano Engine Cloud‑Native Services

Volcano Engine offers a product matrix that includes managed K8s (VKE), elastic container platform (VCI), edge containers, multi‑cloud hybrid solutions (veStack), continuous delivery, image registry, service mesh, and chaos engineering. These services package ByteDance’s years of K8s stability, observability, and service‑mesh expertise for external users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Resource Scheduling Large Scale hybrid deployment

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.