How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok
ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.
Cloud‑Native Computing System
ByteDance, a heavy user of Hadoop ecosystem, has built a cloud‑native big‑data stack that covers data ingestion, storage, middleware, and compute. The storage layer uses a customized CloudFS + Iceberg on top of HDFS, middleware includes Kafka and a self‑developed C++ BMQ, and compute engines are Spark and Flink.
To support exabyte‑level data, ByteDance rewrote HDFS in C++ for performance and integrated object storage alongside native HDFS.
YARN has been heavily customized and can be migrated to Kubernetes, while BMQ is a C++ reimplementation of a compute‑storage‑separated message queue compatible with Kafka.
Since 2016, ByteDance launched Toutiao Cloud Engine (TCE) and gradually containerized both offline and online workloads, achieving full cloud‑native operation on Kubernetes clusters.
Compute Engines: Spark and Flink
Both Spark and Flink coexist at massive scale—over 4 million cores for streaming and 5 million cores for batch—alongside legacy MapReduce and proprietary Primus.
Flink’s stream‑batch integration is explored to reduce data latency, though batch processing still relies on Spark for many analytical tasks.
Dynamic resource allocation allows Flink jobs to request fractional cores, saving resources at large scale.
Fault‑tolerance improvements enable single‑task restarts instead of full job restarts, reducing recovery time.
Resource Scheduling
ByteDance’s average resource utilization exceeds 40%, with some mixed‑cluster pools reaching over 60% thanks to unified scheduling across online (Kubernetes) and offline (YARN) workloads.
The legacy YARN system was rewritten to delegate resource management to Kubernetes while preserving the YARN API, creating a hybrid scheduler.
Three new components were added:
Yodel : simulates YARN ResourceManager and supports YARN APIs.
Unified Scheduler : high‑performance scheduler replacing Kubernetes default, offering multi‑tenant isolation and advanced policies.
BigData Plugin : assists Kubelet with localization, shuffle, and other big‑data job needs.
This unified pool enables transparent resource accounting, cross‑team coordination, and cost reduction.
Post‑Hadoop Cloud‑Native Platform
The shift toward cloud‑native big‑data platforms is inevitable; Spark added Kubernetes support in 2021 and Kafka’s commercial backing also moved to K8s.
ByteDance’s Volcano Engine offers a one‑stop cloud‑native solution—BMQ/Kafka ingestion, real‑time Flink and batch Spark processing, CloudFS storage, and OpenStudio/OpenOps for multi‑tenant management—positioning it as a modern alternative to CDH.
By unifying online and offline resources, the platform can quickly reallocate millions of cores during peak events such as Chinese New Year, ensuring stable service delivery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
