eBay's Cloud‑Native Kafka Big Data Platform: Disaster Recovery and High‑Availability Practices
This article details eBay's implementation of a cloud‑native Kafka platform on Kubernetes, covering operational challenges, K8s Operator deployment, single‑ and multi‑data‑center high‑availability designs, anti‑affinity strategies, automated failover components, and future work on remote storage for Kafka.
Introduction
The sharing session focuses on eBay's cloud‑native Kafka big‑data platform and its disaster‑recovery high‑availability practices.
Agenda
eBay Kafka K8s deployment practice
eBay Kafka single‑data‑center HA practice
eBay Kafka cross‑data‑center HA solution
Future work
1. Kafka Cluster Operational Challenges
Deploying large‑scale Kafka clusters in the cloud faces two main issues: cumbersome cluster setup (Zookeeper dependency, inconsistent broker configurations) and time‑consuming upgrades that require careful monitoring of partition ISR status.
To address these, a one‑click, automated solution is needed.
2. Kafka K8s Deployment Solution
The solution uses a Kubernetes Operator to manage Kafka clusters. The Operator extends the K8s API, encapsulating operational expertise and automating tasks such as broker creation, configuration (broker.id, advertised.listeners), rolling updates, scaling, and self‑healing when pods fail.
Custom Resources (CR) hold Kafka configuration (CPU, RAM, replicas, etc.). The Operator watches CR events and performs corresponding actions across multiple K8s clusters via a Federation layer that makes resources visible to all clusters.
3. eBay Kafka Platform Architecture
The architecture includes a Federation layer that acts as a global API server across all K8s clusters. Operators deployed in this layer listen to a single set of Custom Resources, enabling cross‑cluster management.
4. Single‑Data‑Center HA Practice
eBay runs Kafka at massive scale (2500+ msgs/s, >8 PB disk, 7 000+ brokers, 700 k partitions). Faults are inevitable, and infrastructure maintenance adds complexity.
4.1 Pod Failure
The Operator’s self‑healing automatically recreates a broker pod, relying on Kafka’s replication to maintain data availability.
4.2 Node Failure
Anti‑Affinity rules are added to broker pod specs to prevent multiple replicas of the same partition from being scheduled on the same node.
4.3 Rack Failure
Similar to node failure, Rack‑level anti‑Affinity is used together with the Kafka rack.id setting, ensuring replicas are spread across different racks. When rack resources are insufficient, the anti‑Affinity is set to “preferred” to allow tolerant placement, and a minimum rack count is enforced in the CR.
4.4 Infrastructure Maintenance
During node upgrades, brokers become temporarily unavailable. eBay uses Kubernetes PodDisruptionBudgets (PDB) and a custom HealthMonitor resource to block voluntary pod evictions when the cluster is unhealthy.
5. Cross‑Data‑Center HA Practice
Two disaster‑recovery architectures are used: Active‑Active (mirrored topics across data centers) and Local‑Aggregation (an aggregation layer before replication). Active‑Active offers lower latency but higher complexity; Local‑Aggregation simplifies failover.
5.1 Automatic Failover
After establishing backup, client‑side automatic failover is needed. eBay provides three components: Kafka Healthiness Status Service, HA Producer/Consumer, and Offset Management Service (based on MirrorMaker 2 with custom task logic).
5.1.1 Cluster Health Detection
A unified health‑checking service (Kafka Healthiness Status Service) monitors the whole cluster and provides manual switch‑over capability.
5.1.2 Client Automatic Cluster Switching
When an unhealthy cluster is detected, the HA Producer/Consumer automatically redirects traffic to a healthy cluster.
5.1.3 Offset Synchronization Across Clusters
The Offset Management Service synchronizes consumer group offsets between clusters, built on a customized MirrorMaker 2.
6. Future Work
eBay is evaluating remote storage to replace local disks for Kafka data, potentially combined with tiered storage to control costs.
Remote storage may increase cost, but tiered storage—a feature under development in Kafka—can mitigate this.
Thank you for attending the session.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
