Big Data 14 min read

eBay's Cloud‑Native Kafka Big Data Platform: Disaster Recovery and High‑Availability Practices

This article details eBay's implementation of a cloud‑native Kafka platform on Kubernetes, covering operational challenges, K8s Operator deployment, single‑ and multi‑data‑center high‑availability designs, anti‑affinity strategies, automated failover components, and future work on remote storage for Kafka.

DataFunSummit

Sep 11, 2023

eBay's Cloud‑Native Kafka Big Data Platform: Disaster Recovery and High‑Availability Practices

Introduction

The sharing session focuses on eBay's cloud‑native Kafka big‑data platform and its disaster‑recovery high‑availability practices.

Agenda

eBay Kafka K8s deployment practice

eBay Kafka single‑data‑center HA practice

eBay Kafka cross‑data‑center HA solution

Future work

1. Kafka Cluster Operational Challenges

Deploying large‑scale Kafka clusters in the cloud faces two main issues: cumbersome cluster setup (Zookeeper dependency, inconsistent broker configurations) and time‑consuming upgrades that require careful monitoring of partition ISR status.

To address these, a one‑click, automated solution is needed.

2. Kafka K8s Deployment Solution

The solution uses a Kubernetes Operator to manage Kafka clusters. The Operator extends the K8s API, encapsulating operational expertise and automating tasks such as broker creation, configuration (broker.id, advertised.listeners), rolling updates, scaling, and self‑healing when pods fail.

Custom Resources (CR) hold Kafka configuration (CPU, RAM, replicas, etc.). The Operator watches CR events and performs corresponding actions across multiple K8s clusters via a Federation layer that makes resources visible to all clusters.

3. eBay Kafka Platform Architecture

The architecture includes a Federation layer that acts as a global API server across all K8s clusters. Operators deployed in this layer listen to a single set of Custom Resources, enabling cross‑cluster management.

4. Single‑Data‑Center HA Practice

eBay runs Kafka at massive scale (2500+ msgs/s, >8 PB disk, 7 000+ brokers, 700 k partitions). Faults are inevitable, and infrastructure maintenance adds complexity.

4.1 Pod Failure

The Operator’s self‑healing automatically recreates a broker pod, relying on Kafka’s replication to maintain data availability.

4.2 Node Failure

Anti‑Affinity rules are added to broker pod specs to prevent multiple replicas of the same partition from being scheduled on the same node.

4.3 Rack Failure

Similar to node failure, Rack‑level anti‑Affinity is used together with the Kafka rack.id setting, ensuring replicas are spread across different racks. When rack resources are insufficient, the anti‑Affinity is set to “preferred” to allow tolerant placement, and a minimum rack count is enforced in the CR.

4.4 Infrastructure Maintenance

During node upgrades, brokers become temporarily unavailable. eBay uses Kubernetes PodDisruptionBudgets (PDB) and a custom HealthMonitor resource to block voluntary pod evictions when the cluster is unhealthy.

5. Cross‑Data‑Center HA Practice

Two disaster‑recovery architectures are used: Active‑Active (mirrored topics across data centers) and Local‑Aggregation (an aggregation layer before replication). Active‑Active offers lower latency but higher complexity; Local‑Aggregation simplifies failover.

5.1 Automatic Failover

After establishing backup, client‑side automatic failover is needed. eBay provides three components: Kafka Healthiness Status Service, HA Producer/Consumer, and Offset Management Service (based on MirrorMaker 2 with custom task logic).

5.1.1 Cluster Health Detection

A unified health‑checking service (Kafka Healthiness Status Service) monitors the whole cluster and provides manual switch‑over capability.

5.1.2 Client Automatic Cluster Switching

When an unhealthy cluster is detected, the HA Producer/Consumer automatically redirects traffic to a healthy cluster.

5.1.3 Offset Synchronization Across Clusters

The Offset Management Service synchronizes consumer group offsets between clusters, built on a customized MirrorMaker 2.

6. Future Work

eBay is evaluating remote storage to replace local disks for Kafka data, potentially combined with tiered storage to control costs.

Remote storage may increase cost, but tiered storage—a feature under development in Kafka—can mitigate this.

Thank you for attending the session.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data kubernetes kafka Disaster Recovery

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.