Cloud Native 19 min read

Unlocking Elastic Resource Sharing: TikTok’s Cloud‑Native Mix‑Mode Scaling

This article explains how TikTok’s cloud‑native platform leverages elastic scaling, monitoring, and quota systems to dynamically share resources between online, latency‑sensitive services and offline, batch workloads, improving utilization while preserving service stability across tidal traffic patterns.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
Unlocking Elastic Resource Sharing: TikTok’s Cloud‑Native Mix‑Mode Scaling

TikTok’s business consists of online services that require low latency (e.g., web, algorithm, video codec, FaaS) and offline services such as batch queries, reporting, model training, and data analysis. Online traffic shows a clear tidal pattern: peak usage during evening hours and a low‑traffic valley at night, causing CPU utilization to drop to 20‑30% of the peak.

To avoid wasting resources, the platform uses Kubernetes native Horizontal Pod Autoscaler (HPA) concepts wrapped in a custom HPAGroup to scale stateless online services up and down. When online services shrink during the valley, the freed resources become elastic resources that can be re‑allocated to offline jobs.

Because elastic resources are volatile, a dedicated monitoring subsystem collects real‑time metrics (QPS, P99 latency, CPU, load) via agents, aggregates them in a central store, and feeds a controller that adjusts HPAGroup replica counts. A custom quota system, built on a CRD that records group‑level resource limits, ensures that scaling actions never exceed cluster capacity and that resources are reclaimed safely.

The control plane also tracks a “deployment water level”. When the water level falls below a threshold (e.g., 90% of capacity), nodes are marked unschedulable for online pods, their existing pods are drained, and offline workloads are scheduled onto those nodes, effectively lending whole machines to batch jobs.

To guarantee stability, the system prioritizes resource reclamation: high‑priority online services can pre‑empt low‑priority offline jobs, and a three‑tier priority mapping (High, Min, Low) guides which offline tasks are killed first during sudden resource shortages.

Three concrete use cases illustrate the approach: (1) video transcoding services share elastic resources with online pods, using a GroupCRD and a scaler that adjusts pod replicas; (2) Ring AllReduce training for NLP models runs stable workers on reserved resources while elastic workers use the reclaimed capacity, achieving up to 1:8 acceleration; (3) a custom PS‑Worker framework for large‑scale recommendation training shares CPU/GPU with online services, employing NUMA‑aware isolation and dual‑NIC traffic shaping to protect latency‑sensitive traffic.

In production, the solution enables the cluster to lend roughly 3 million core‑hours per day, significantly improving overall utilization while keeping online SLA impact minimal.

Overall, time‑slot elastic mixing is suitable for early‑stage infrastructure users who need rapid scaling and higher resource efficiency without heavy upfront capacity planning.

monitoringCloud NativeKuberneteselastic scalingresource sharing
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.