How eBay Scales Its Event Platform with ClickHouse and Kubernetes
This article details eBay's event platform architecture, explaining why a dedicated event system is needed, how ClickHouse provides high‑performance storage, the use of Kubernetes CRDs for cross‑region high availability, data routing, read/write separation, and query optimizations with LogQL.
Background
Before introducing the event platform, the monitoring platform’s four signal types—metrics, logs, traces, and events—are described. Multi‑dimensional analysis, alerting, and anomaly detection are built on these signals, leading to solutions such as BCD, Groot, and Exemplar for root‑cause analysis and rapid issue localization.
What Is an Event?
Events are non‑periodic and can be user‑generated (deployment, scaling, configuration) or system‑generated (alerts, access logs), often carrying arbitrary key‑value pairs with high cardinality, making metric‑based solutions unsuitable.
Event Platform
The platform ingests 200 billion events per day, handles over 5 million queries daily, and runs on more than 400 ClickHouse nodes with over 1 PB of storage.
ClickHouse was chosen for its column‑store architecture, high compression (10‑100×) using LowCardinality, Delta encoding, LZ4/ZSTD, and vectorized columnar computation that maximizes CPU cache hits and SIMD utilization.
It also supports runtime code generation, vertical and horizontal query parallelism, and plans for per‑shard multi‑replica parallel processing.
The platform is fully containerized and orchestrated with Kubernetes, using custom resources (CRDs) such as FCHI and CHI to manage cross‑region ClickHouse clusters. Otel‑compatible data models are used for event collection, and both SQL and LogQL are provided for querying, integrating with Grafana and Prometheus.
Data routing is handled via WISB (expected routing) and WIRI (actual routing) records, enabling namespace‑based virtual ClickHouse resources and lightweight migration using virtual clusters and distributed tables.
Read/write separation is achieved through the
readWriteModparameter in FCHI, creating separate virtual sub‑clusters for reads and writes.
Typical Case
eBay migrated a service‑mesh monitoring workload from Elasticsearch to ClickHouse, reducing storage to 30 % of the original while extending retention from 9 to 30 days and improving anomaly‑detection query performance tenfold.
Future Outlook
The platform currently supports only fully structured data; future work includes adding support for semi‑structured and unstructured data using Map‑based free schema and ClickHouse’s JSON column type, as well as optimizing cross‑region aggregation to reduce network traffic.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.