How eBay Built a Scalable Kafka‑Based Real‑Time Data Transmission Platform
This article details eBay's year‑long development of an enterprise‑grade, Kafka‑driven data transmission platform, covering its architecture, core services, monitoring and automation strategies, as well as performance tuning techniques that enable high throughput, low latency, and reliable cross‑data‑center replication.
Preface
This article describes the enterprise‑grade data transmission platform eBay built over the past year based on Kafka, including the implementation experience, operational insights, and lessons learned.
1. Overview of eBay Data Transmission Platform
1.1 Why Build a Transmission Platform
Online systems generate massive amounts of data that need to be moved to offline systems for analytics, fraud detection, personalization, and other real‑time use cases. eBay’s online services run on tens of thousands of machines and store petabytes of relational data, requiring a reliable, low‑latency pipeline to transport changes to downstream systems.
Because many offline workloads depend on data from online systems, a real‑time platform is essential to avoid waiting for batch loads.
1.2 Why Use Kafka
Kafka was chosen for its high throughput, strong performance, multi‑subscription support, and message durability. It also offers linear scalability and high availability; the failure of a few nodes does not affect data integrity.
High Throughput Kafka can handle orders of magnitude more messages per second than most other middleware.
High Performance End‑to‑end latency can reach the millisecond level.
Multi‑Subscription Kafka natively supports multiple consumers on the same topic.
Message Durability All data can be replayed at any time.
1.3 Kafka’s Processing Capacity
eBay operates more than 30 Kafka clusters on its private OpenStack cloud, totaling over 800 virtual machines and 1,200 applications, handling more than 100 billion messages per day. Clusters are partitioned by business domain (e.g., user behavior, price changes) to isolate workloads and manage quotas.
Simply adopting open‑source Kafka is insufficient; eBay adds many enterprise services on top of it to meet security, multi‑data‑center, and operational requirements.
2. Core Services of the eBay Platform
2.1 Purpose of the Metadata Service
The metadata service provides a unified namespace for topics, allowing users to discover the correct Kafka cluster without knowing its physical location. It also introduces the concept of “topic packaging” to enforce quota management and prevent resource abuse.
2.2 Metadata Service Operation
The service stores metadata separately from Kafka’s internal topics because the built‑in Kafka management topics are insufficient for enterprise needs.
2.3 Kafka Proxy
A logical layer aggregates the dozens of clusters so that clients can connect to a single endpoint while the proxy routes requests to the appropriate underlying cluster, fully implementing the Kafka protocol.
2.4 Tier‑Aggregation and Data Mirroring Service
Data is replicated across multiple data centers (e.g., Shanghai and Beijing) using a Tier‑Aggregation model. Mirroring services handle cross‑data‑center replication, ensuring continuity when a site fails.
2.5 Schema Registry Service
A schema registry enforces a unified data format for all messages flowing through the platform, building on Kafka’s native schema component but adding enterprise‑level governance.
2.6 User Self‑Service Portal
A web portal allows users to create topics, register producers/consumers, and manage quotas without contacting administrators directly.
The platform also includes a lightweight PaaS built on OpenStack to automate cluster provisioning, node replacement, and configuration management.
3. System Monitoring and Automation
3.1 Cluster Node Monitoring
Monitoring provides a unified view of node health across all clusters, distinguishing between healthy, failed, and degraded nodes. Manual remediation is currently used, with plans for automated replacement based on defined rules.
3.2 Kafka Status Monitoring
Both node‑level and Kafka‑level metrics are collected, exposing consumer lag, throughput, and latency to operators and end users.
3.3 Slow Node Detection and Handling
Slow nodes (e.g., delivering only 1/10 of expected throughput) are identified via CPU, disk I/O, and custom footprint topics. Remedies include stopping, restarting, or manually rebalancing partitions.
3.4 Offset Index and Automatic Failover
When a data center fails, the Kafka proxy redirects clients to another site, handling offset differences and ensuring continuous consumption without manual intervention.
4. Kafka Performance Optimizations
4.1 Sequential Disk I/O
Kafka relies on sequential disk reads/writes; avoiding random seeks preserves high throughput.
4.2 Page‑Cache‑Centric Design
Kafka reads data from the OS page cache rather than directly from disk, eliminating swap and improving latency. Proper NUMA and CPU pinning in OpenStack are required for optimal performance.
4.3 Zero‑Copy
Zero‑Copy transfers data from disk to the network socket without copying to user space, but can be disabled by SSL/TLS.
4.4 Other Parameter Tweaks
Key settings include large file descriptor limits, increased max socket buffer size, appropriate unclean.leader.election.enable, more replica fetcher threads, and tuned leader rebalance parameters.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.