Big Data 16 min read

How Cloud‑Native Architecture Transforms Big Data Operations at ByteDance

This article explains how ByteDance migrated its complex, component‑heavy big‑data platform to a cloud‑native architecture, detailing the challenges of traditional deployments, the benefits of micro‑service, container, immutable‑infrastructure and declarative‑API approaches, and the resulting low‑resource, highly‑scalable, portable operations framework.

Efficient Ops
Efficient Ops
Efficient Ops
How Cloud‑Native Architecture Transforms Big Data Operations at ByteDance

01 Business Situation and Background

Cloud‑native big data is the next‑generation architecture for big‑data platforms. As ByteDance’s internal services grew rapidly, the traditional big‑data operations platform showed drawbacks such as numerous components, complex installation and tight coupling with the underlying environment, and a lack of out‑of‑the‑box logging, monitoring, and alerting for business users.

Component proliferation : Big‑data tasks require many components (e.g., distributed storage, Flink, Spark, ETL tools, schedulers, logging and monitoring systems, permission services).

Deployment complexity : The many inter‑dependent components lead to difficult deployments, strong and weak dependencies, and even circular dependencies.

Environment coupling : Components often need to know configuration details of other components, creating deep coupling that hinders portability.

To address these issues, ByteDance began a cloud‑native transformation of its tools.

Cloud‑Native Scenario Features

No service state perception : Users can consume functionality without caring about the underlying runtime state.

Extreme elasticity : Hiding runtime state enables on‑demand scaling, significantly reducing costs.

Rapid failover : Elastic scaling allows quick removal of faulty nodes and addition of healthy ones, providing seamless, loss‑less failover for users.

These three characteristics reinforce each other, forming a virtuous cycle.

Cloud‑Native Evolution Directions

Component micro‑service : Decompose the system into small, cohesive components to achieve high cohesion and low coupling.

Application containerization : Containers ensure portability and consistency across environments.

Immutable infrastructure : Encapsulate everything to isolate the underlying infrastructure, improving consistency, reliability, and simplicity.

Declarative API : Users declare the desired state; the backend fulfills it, reducing user awareness of internal processes and simplifying evolution.

02 Architecture Evolution

Cloud‑Native Big Data Overview

Cloud‑native big data runs on containers, which can be public‑cloud services or private‑cloud Kubernetes (K8s) bases. The platform consists of three layers and a supporting system:

Scheduling layer : Manages compute, storage, and network resources across the cluster.

Engine layer : ByteDance’s unified storage system (compatible with HDFS, also supporting S3) and compute engines such as Flink, Spark, message middleware, and real‑time analytics tools.

Platform layer : Packages engine capabilities into an external product.

The operations management platform supports these three layers, providing daily component management and adapting to the cloud‑native transformation.

Cloud‑Native Operations Practices

Low resource footprint : The operations module should be barely noticeable and consume minimal resources.

Strong scalability : Logging and monitoring scale with cluster size, requiring rapid horizontal scaling.

High stability : Must recover quickly from failures and provide disaster‑recovery for other components.

Strong portability : Should be environment‑agnostic, supporting plug‑and‑play deployment across clouds.

Weak environment perception : Shield business users from environment differences, offering a uniform experience.

To meet these goals, ByteDance abstracts a unified environment model, builds a flexible component‑management service for metadata, and abstracts common functions (logging, monitoring, alerts) to hide environmental differences.

03 Environment Management and Component Services

Environment Management

The environment is divided into three logical zones (not physical isolation): control plane, system plane, and data plane.

Control plane : Provides weak business support, handling environment governance, cost accounting, and service gateway functions.

System plane : Exists per logical unit; multiple units can be coordinated by the control plane (e.g., multi‑active regions).

Data plane : Supplies compute, storage, and network resources for engine execution, forming logical federated clusters under system‑plane coordination.

Component Services

Components are layered as system‑level, cluster‑level, tenant‑level, and project‑level. System‑level handles most governance logic; cluster‑level manages agents, schedulers, and operators; tenant‑level supports dedicated large‑user components; project‑level hosts job instances, middleware, and third‑party tools. This grid‑style division isolates components from environment details.

Helm Customization : Native K8s resources (Deployment, ConfigMap, Service) lack cohesive tooling for cloud‑native component APIs. ByteDance deep‑customized Helm into a service‑oriented API, exposing deployment, upgrade, rollback, and visualization functions, enabling fine‑grained configuration merging, dynamic resource modification, and rapid validation.

Disk Management

K8s handles stateless workloads well but struggles with stateful workloads due to local disk handling. Identified pain points include environment coupling, low utilization, poor isolation, and high maintenance difficulty.

ByteDance built a unified CSI to collect and manage all disk information, categorizing storage into three types:

Shared capacity volume : Low‑IO, flexible capacity for temporary data such as logs or intermediate results.

Shared disk volume : Moderate IO, requires isolation and persistence, suitable for caches.

Exclusive disk volume : High IO isolation for workloads like Kafka or HDFS.

Disk management is split into two zones: K8s‑maintained areas (e.g., EmptyDir for config or small temporary data) and CSI‑managed zones, which further divide into the three volume types. The CSI abstracts storage classes, allowing both public‑cloud disks and centralized storage to be provisioned uniformly, decoupling disks from components.

cloud nativeBig DataoperationsKubernetesCSIDisk ManagementHelm
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.