Cloud Native 25 min read

How ByteDance Boosts Cluster Utilization with Elastic Scaling and Mixed Deployment

This article explains how ByteDance's private cloud platform TCE leverages Kubernetes deployments, oversubscription, elastic scaling, and mixed online‑offline resource sharing to dramatically improve cluster resource utilization while maintaining service stability during traffic peaks and valleys.

Volcano Engine Developer Services

May 7, 2021

How ByteDance Boosts Cluster Utilization with Elastic Scaling and Mixed Deployment

Background

ByteDance runs almost all stateless services as containers on its private cloud platform TCE, which uses Kubernetes for orchestration. Services include typical micro‑services and algorithm‑heavy workloads such as recommendation and advertising.

These services are deployed as Kubernetes Deployments with multiple replicas exposing RPC or HTTP interfaces behind Consul or load balancers. Their instances can migrate across nodes, and resource usage correlates with traffic, enabling dynamic replica count control.

TCE operates at massive scale: over 40 Kubernetes clusters across China, Singapore, and US East, managing hundreds of thousands of servers, more than 40,000 services, 300,000+ Deployments and 3 million Pods. The sheer size drives the need to improve overall resource utilization.

Resource Utilization Analysis

Online services show stable daily traffic, but often request more resources than needed to guarantee stability, leading to wasted capacity (the gap between requested and actual usage). Additionally, traffic exhibits tidal patterns with high peaks during evening hours and low valleys at night, causing further inefficiency.

To address this, ByteDance applies three complementary techniques:

Oversubscription to reclaim redundant resources.

Elastic scaling to shrink resources during low‑traffic periods.

Mixed deployment to lend idle online resources to offline jobs.

Elastic Scaling

Service owners set utilization thresholds (CPU, memory, QPS, etc.). The scaling controller polls average utilization of all replicas; if below the threshold, it reduces replica count, otherwise it adds replicas. Stability is critical, so the system ensures rapid scale‑up after a scale‑down.

Key supporting mechanisms include:

Cluster‑level scalability and high availability to handle frequent scaling actions.

A robust monitoring system that provides real‑time utilization data for scaling decisions.

A quota system that guarantees resource limits remain controllable during scaling.

The monitoring stack replaces the native Metrics Server with a custom solution comprising SysProbe agents, a Metrics Agent, a Proxy, and an in‑memory Store, achieving ~30 ms query latency and ~60 s collection latency.

Scaling decisions are executed by a custom CRD‑based controller (HPAExtension) that synchronizes every 30 seconds, supports multiple resource dimensions (CPU, memory, GPU), time‑based configurations, tolerance thresholds, and stepwise scaling to avoid sudden pressure on storage components.

Mixed Deployment (Resource Sharing)

After elastic scaling frees resources during low‑traffic periods, these resources are offered to offline workloads (e.g., video transcoding, model training) that have no strict timing constraints. The system introduces a cluster deployment water‑level concept: when the water‑level drops below a threshold, selected online nodes are marked unschedulable and transferred to the offline cluster.

During traffic peaks, the process reverses: offline tasks are drained, nodes are marked schedulable again, and resources are reclaimed for online services. A state machine governs node transitions (Online, Offline, OnlineToOffline, OfflineToOnline) to ensure smooth handover.

Two mixed‑deployment modes are used:

Water‑level based sharing, where idle nodes are lent to offline jobs.

Timed‑quantity sharing, where offline clusters receive whole machines during online peaks to compensate for resource shortages.

Future work aims to decouple resource providers and consumers via a resource market and a hierarchical quota system, enabling unified elastic resource management, cost accounting, and priority‑based allocation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Operations kubernetes Elastic Scaling mixed deployment

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.