How Cloud‑Native Architecture Transforms Big Data Operations at ByteDance
This article explains how ByteDance migrated its complex, component‑heavy big‑data platform to a cloud‑native architecture, detailing the challenges of traditional deployments, the benefits of micro‑service, container, immutable‑infrastructure and declarative‑API approaches, and the resulting low‑resource, highly‑scalable, portable operations framework.
01 Business Situation and Background
Cloud‑native big data is the next‑generation architecture for big‑data platforms. As ByteDance’s internal services grew rapidly, the traditional big‑data operations platform showed drawbacks such as numerous components, complex installation and tight coupling with the underlying environment, and a lack of out‑of‑the‑box logging, monitoring, and alerting for business users.
Component proliferation : Big‑data tasks require many components (e.g., distributed storage, Flink, Spark, ETL tools, schedulers, logging and monitoring systems, permission services).
Deployment complexity : The many inter‑dependent components lead to difficult deployments, strong and weak dependencies, and even circular dependencies.
Environment coupling : Components often need to know configuration details of other components, creating deep coupling that hinders portability.
To address these issues, ByteDance began a cloud‑native transformation of its tools.
Cloud‑Native Scenario Features
No service state perception : Users can consume functionality without caring about the underlying runtime state.
Extreme elasticity : Hiding runtime state enables on‑demand scaling, significantly reducing costs.
Rapid failover : Elastic scaling allows quick removal of faulty nodes and addition of healthy ones, providing seamless, loss‑less failover for users.
These three characteristics reinforce each other, forming a virtuous cycle.
Cloud‑Native Evolution Directions
Component micro‑service : Decompose the system into small, cohesive components to achieve high cohesion and low coupling.
Application containerization : Containers ensure portability and consistency across environments.
Immutable infrastructure : Encapsulate everything to isolate the underlying infrastructure, improving consistency, reliability, and simplicity.
Declarative API : Users declare the desired state; the backend fulfills it, reducing user awareness of internal processes and simplifying evolution.
02 Architecture Evolution
Cloud‑Native Big Data Overview
Cloud‑native big data runs on containers, which can be public‑cloud services or private‑cloud Kubernetes (K8s) bases. The platform consists of three layers and a supporting system:
Scheduling layer : Manages compute, storage, and network resources across the cluster.
Engine layer : ByteDance’s unified storage system (compatible with HDFS, also supporting S3) and compute engines such as Flink, Spark, message middleware, and real‑time analytics tools.
Platform layer : Packages engine capabilities into an external product.
The operations management platform supports these three layers, providing daily component management and adapting to the cloud‑native transformation.
Cloud‑Native Operations Practices
Low resource footprint : The operations module should be barely noticeable and consume minimal resources.
Strong scalability : Logging and monitoring scale with cluster size, requiring rapid horizontal scaling.
High stability : Must recover quickly from failures and provide disaster‑recovery for other components.
Strong portability : Should be environment‑agnostic, supporting plug‑and‑play deployment across clouds.
Weak environment perception : Shield business users from environment differences, offering a uniform experience.
To meet these goals, ByteDance abstracts a unified environment model, builds a flexible component‑management service for metadata, and abstracts common functions (logging, monitoring, alerts) to hide environmental differences.
03 Environment Management and Component Services
Environment Management
The environment is divided into three logical zones (not physical isolation): control plane, system plane, and data plane.
Control plane : Provides weak business support, handling environment governance, cost accounting, and service gateway functions.
System plane : Exists per logical unit; multiple units can be coordinated by the control plane (e.g., multi‑active regions).
Data plane : Supplies compute, storage, and network resources for engine execution, forming logical federated clusters under system‑plane coordination.
Component Services
Components are layered as system‑level, cluster‑level, tenant‑level, and project‑level. System‑level handles most governance logic; cluster‑level manages agents, schedulers, and operators; tenant‑level supports dedicated large‑user components; project‑level hosts job instances, middleware, and third‑party tools. This grid‑style division isolates components from environment details.
Helm Customization : Native K8s resources (Deployment, ConfigMap, Service) lack cohesive tooling for cloud‑native component APIs. ByteDance deep‑customized Helm into a service‑oriented API, exposing deployment, upgrade, rollback, and visualization functions, enabling fine‑grained configuration merging, dynamic resource modification, and rapid validation.
Disk Management
K8s handles stateless workloads well but struggles with stateful workloads due to local disk handling. Identified pain points include environment coupling, low utilization, poor isolation, and high maintenance difficulty.
ByteDance built a unified CSI to collect and manage all disk information, categorizing storage into three types:
Shared capacity volume : Low‑IO, flexible capacity for temporary data such as logs or intermediate results.
Shared disk volume : Moderate IO, requires isolation and persistence, suitable for caches.
Exclusive disk volume : High IO isolation for workloads like Kafka or HDFS.
Disk management is split into two zones: K8s‑maintained areas (e.g., EmptyDir for config or small temporary data) and CSI‑managed zones, which further divide into the three volume types. The CSI abstracts storage classes, allowing both public‑cloud disks and centralized storage to be provisioned uniformly, decoupling disks from components.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.