Design and Implementation of Cloud‑Native High‑Availability Solutions for Data Components at eBay
eBay’s data infrastructure engineers describe how they design and implement cloud‑native, multi‑cluster high‑availability architectures for stateful data components—covering background challenges, federated Kubernetes management, state handling, fault‑tolerance, backup, and chaos testing—to ensure reliable, scalable data services across global data centers.
Data component cloud‑nativeization is a key trend for big‑data open‑source products, and cloud providers are continuously offering platform‑ized data services. In multi‑availability‑zone, multi‑Kubernetes‑cluster deployments, designing high‑availability solutions for data components becomes a focal challenge.
01 Background and Challenges
Data components refer to the data‑related products in the big‑data ecosystem, both open‑source and internal. Their critical role in the data lifecycle makes cloud‑native adoption difficult. Kubernetes is inherently suited for stateless workloads; when a pod fails, a new one is created automatically. However, stateful data components store state outside the pod, requiring careful high‑availability design.
Stateful services must consider distributed consistency, data persistence, I/O performance, fault tolerance, and backup/recovery. eBay’s cloud‑native journey began with these hardest problems, deploying the first data services on Kubernetes and extracting reusable, platform‑level patterns.
Single‑cluster Kubernetes can support only a few thousand nodes; scaling to many clusters is inevitable for resource demand, high‑availability, and multi‑region resilience. eBay uses a "Federated Cluster" model with independent ApiServer and ETCD instances that manage global objects and synchronize them to real clusters distributed across global data centers.
Key challenges in this multi‑cluster setup include optimal fault‑tolerant placement of data components, cross‑cluster service discovery, handling endpoint failures, global resource optimization, and seamless maintenance without service disruption.
02 Cloud‑Native Architecture
Given the background, eBay designs a cloud‑native architecture for data services in a multi‑cluster environment. Open‑source data products often lack native cloud support, so eBay builds automation layers that provide high fault‑tolerance, manageability, and observability.
Using Zookeeper as a simple example, a 2+2+1 deployment across three availability zones ensures that a single zone failure still leaves a majority of replicas running. For more complex stateful services like Kafka, eBay developed a FederatedStatefulSet that offers a declarative API similar to StatefulSet but operates across clusters.
The FederatedStatefulSet adds capabilities such as in‑place upgrades, custom gray‑release strategies, configuration management, and seamless cross‑cluster hot migration, extending beyond the native StatefulSet’s limited lifecycle hooks.
For Kafka, data partitions are spread across nodes; high‑availability requires placing replicas in different zones while keeping the entire cluster within a single data center to meet low‑latency I/O requirements. This demands two‑layer scheduling: container‑level anti‑affinity and data‑level anti‑affinity provided by the application.
Because many data services consist of multiple inter‑dependent components (e.g., Kafka + Zookeeper, Elasticsearch + Kibana), eBay built a cloud‑native data component management platform that offers unified deployment, automated operations, cross‑cluster scheduling, chaos testing, and RBAC‑based multi‑tenant isolation.
03 State Management
State is crucial for data components. Both proactive state transitions (e.g., upgrades) and reactive ones (e.g., failures) must keep services available. eBay provides two patterns:
Controller mode : a workflow‑driven controller receives user requests and safely transitions the cluster from its current to the desired state, handling failures and offering pluggable lifecycle hooks.
SideCar mode : a SideCar agent forms a Raft cluster to monitor the main process; on failure it performs automatic leader election and failover for applications lacking native HA.
Additionally, Kubernetes Pod Disruption Budgets (PDB) protect against accidental node removal. eBay extends this to an Application Disruption Budget (ADB) that validates application health before proceeding with node‑drain operations, allowing fine‑grained control such as rack‑aware draining.
04 Fault Tolerance and Recovery
In a multi‑cluster, multi‑AZ setup, high‑availability is achieved by deploying active‑passive clusters across zones and performing failover when needed. Reliable failure detection relies on robust monitoring systems.
After a failure, data must be synchronized with verification mechanisms to ensure correctness. Backup and restore are also critical; eBay runs backup jobs inside the SideCar agent, records task status, and supports configurable backup policies.
To validate HA designs, eBay incorporates chaos testing. Using open‑source chaos tools (e.g., Litmus), they simulate container, network, and disk failures at various scopes—data center, cluster, rack, or pod—while enforcing strict permission controls and generating detailed test reports.
05 Summary
The cloud‑native transformation of big‑data products is ongoing, and reliable data services remain a critical driver for enterprises. This article presented eBay’s end‑to‑end solution for managing data components across multiple Kubernetes clusters, covering architecture, state management, fault tolerance, backup, and chaos testing. Future work will continue to evolve the platform toward smarter, more efficient data services.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.