ByteDance Stateful Application Cloud‑Native Practices
ByteDance’s cloud‑native migration of stateful services uses a custom SolarService extending StatefulSet with Budset CRD to handle versioned data, shard‑aware routing, NUMA‑aware scheduling, advanced storage, eBPF monitoring, and automated PDB eviction, delivering efficiency, cost savings, and reliable rolling upgrades.
Background
The talk introduces the challenges and solutions ByteDance faced when migrating stateful applications to a cloud‑native environment. It contrasts stateless services, which fit naturally with Kubernetes objects like Deployments, with stateful services that require data persistence, sharding, and unique instance identifiers.
Characteristics of Stateful Applications
Stateful apps depend on local data, must preserve data across upgrades, and often have master‑slave or primary‑replica relationships. They can be data‑stateful or network‑stateful.
Business Scenarios at ByteDance
Typical use cases include search recall (large models with long load times), push services (per‑shard user targeting requiring unique IDs), and storage services such as custom KV stores, Druid, and Elasticsearch, which combine data locality and replica relationships.
Benefits of Cloud‑Native Migration
Efficiency gains come from standardized infrastructure APIs, abstracted business frameworks, automated processes, and unified delivery via containers. Cost reductions stem from faster container start‑up and on‑demand resource allocation.
Challenges and Solutions
Key challenges involve state management, enhanced base capabilities, and automated operations. ByteDance built a custom SolarService that extends StatefulSet with a Budset CRD to manage data versioning and side‑car data sync.
State Management
Three aspects are addressed: version management (similar to Deployment/StatefulSet upgrades), data management (updating external data without changing replicas), and service discovery & routing (directing requests to the correct shard). The solution includes a matrix of Pods per shard, a custom controller for rolling upgrades, and a proxy layer for routing based on shard and replica health.
Rolling Upgrade Example
Shards are upgraded in parallel, respecting a configurable MaxUnavailable per shard. Images illustrate the process.
Scaling
Scaling a shard’s replica count follows standard StatefulSet scaling. Scaling data shards involves a multi‑step process that doubles the number of shards, updates Budsets, and gradually shifts traffic to the new shards.
Service Discovery & Routing
A custom Service Discovery component registers pod IPs, ports, shard IDs, and replica IDs in a KV store, enabling fine‑grained routing and circuit‑breaking without relying on native K8s Service routing.
Base Capability Enhancements
Two main areas: scheduling and storage. Scheduling leverages NUMA‑aware enhancements to the K8s scheduler and Kubelet, exposing micro‑topology resources via CRDs and custom predicates/priorities, and assigning CPU sets and NUMA nodes to pods.
Storage enhancements include dynamic provisioning for multiple media, remote block storage via NBD (single‑write‑single‑read and multi‑read modes), and local disk solutions (tmpfs, LVM, full‑disk allocation, and Intel AEP). The system also implements Volume Scheduling with custom predicates, assume‑volume annotations, and bind phases.
Monitoring & Automated Operations
ByteDance developed an eBPF‑based container‑level monitoring component SysProbe that collects over 100 metrics, aggregated by a high‑availability Metrics Aggregation Server (MAS) and exported to downstream sinks.
For automation, a custom PDB extension via webhook adds eviction strategies that consider multi‑AZ pod distribution, ensuring safe pod removal during host maintenance.
CSI Race Conditions and Mitigations
Issues such as residual global mounts, duplicate volume opens, and race conditions during pod deletion were addressed by adding residual mount scans in the Kubelet Volume Manager and enhancing CSI drivers to handle unstage failures.
Case Study
Exploration of lightweight virtualization with Kata to contain failures at the pod level rather than the host.
Conclusion
Stateful applications in ByteDance’s cloud‑native journey exhibit local data dependency, persistence, and unique instance identification. Cloud‑native transformation yields efficiency and cost benefits through improved state management, extreme performance via NUMA‑aware scheduling, enriched storage capabilities, and automated operations with custom PDB extensions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
