Scaling eBay’s Sherlock.io ClickHouse Platform with Read/Write Separation and Keeper
The article details how eBay’s Sherlock.io event monitoring platform, built on ClickHouse, faced scaling and performance challenges due to ZooKeeper bottlenecks, and explains the design and implementation of read/write separation, shard‑level Keeper coordination, and related operational fixes to improve reliability and latency.
Problem Statement
eBay’s Sherlock.io event monitoring platform stores, processes, and visualizes diverse event data using a ClickHouse cluster. The platform originally relied on a shared ClickHouse cluster, with dedicated clusters for high‑volume users. Over 20 ClickHouse clusters (shared and dedicated) are in use, each with multiple shards and three replicas spread across three eBay data centers.
When a user’s data volume grows, a new dedicated ClickHouse cluster can be spun up. However, the OLAP use case exposed severe load issues: bad queries from self‑service alert rules saturated the ClickHouse thread pool, causing high load and occasional data loss during UTC‑0 rotation. The architecture uses ClickHouseReplicatedMergeTree tables synchronized via a centralized ZooKeeper ensemble. Each shard creates ~100 new parts per second, generating >30 billion inserts daily, leading to >1 000 outstanding ZooKeeper requests and data‑loss incidents.
These symptoms indicated that ZooKeeper had become a horizontal scaling bottleneck as shard count increased.
Solution Overview
Two complementary strategies were adopted:
Read/Write Separation : Introduce a cold‑hot tier where hot data resides on local SSDs. Designate specific replicas within each shard as write‑only and others as read‑only. A readWriteMod field in the FCHC CRD controls the mode; replicas where (replica_num % readWriteMod) != 0 become read nodes, while those where the remainder is zero become write nodes. The operator creates a virtual read‑only cluster and rewrites distributed tables to point to it.
Shard‑Level Consistency Service : Replace the single ZooKeeper ensemble with per‑shard coordination services. Starting with ClickHouse Keeper (a ZooKeeper‑compatible service built on the same code base and Raft algorithm), each shard runs its own Keeper cluster, eliminating the global ZooKeeper pressure.
Implementation Details
Read/Write separation is enabled by adding readWriteMod to the FCHC CRD. When the value is greater than 1, the operator classifies replicas accordingly and creates a virtual read cluster. The write replica handles ingress writes; reads are served exclusively by read replicas.
To adopt Keeper, the CRD gained an enableKeeper boolean and a keeperConfig section. When true, the operator generates Keeper configuration files for each ClickHouse server and can override specific Keeper settings.
enableKeeper: true
keeperConfig:
coordinationSettings:
raft_logs_level: trace
keeperNodesCount: 3
tcpPort: 9181For servers that also run a Keeper instance, the <keeper_server> XML block is added to the ClickHouse config:
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>190</server_id>
<log_storage_path>/var/lib/keeper/log</log_storage_path>
<snapshot_storage_path>/var/lib/keeper/snapshots</snapshot_storage_path>
<raft_configuration>
<server>
<id>190</id>
<hostname>host-38-0-0</hostname>
<port>9999</port>
</server>
...
</raft_configuration>
</keeper_server>ClickHouse servers that rely on an external coordination service reference it via the standard <zookeeper> block, now pointing to the per‑shard Keeper endpoints:
<zookeeper>
<node>
<host>host-38-0-0</host>
<port>9181</port>
</node>
<node>
<host>host-45-0-0</host>
<port>9181</port>
</node>
<node>
<host>host-94-0-0</host>
<port>9181</port>
</node>
</zookeeper>Testing and Operational Fixes
During Keeper integration testing, several issues surfaced:
Ordinary database creation failure : New ClickHouse versions default to the Atomic engine and deprecate Ordinary databases. Adding allow_deprecated_database_ordinary: "true" resolves the error.
ClickHouse server startup deadlock : The operator’s default readiness probe on port 8123 blocked server start because Raft required the server to be ready first. Adding a config‑file readiness check broke the deadlock.
IP reuse after pod restart : In tight‑resource clusters, pod IPs are recycled, causing a shard’s pod to join the wrong quorum. The workaround is to use distinct ports per shard, preventing cross‑shard quorum membership.
Shard latency spikes : When all clients connect to a single Keeper node, write‑batching is fragmented, raising latency. Distributing clients across multiple local Keeper instances mitigated the issue.
Keeper server startup timeout : Large log stores cause the default 30 s Raft initialization timeout to be exceeded. Increasing the startup timeout and adding a PR that snapshots on shutdown reduced start‑up time dramatically.
Production Adoption
All new ClickHouse clusters in Sherlock.io now run with Keeper as the default coordination service, and read/write separation is enabled for critical workloads. The original OLAP cluster has been migrated to the Keeper‑backed architecture, achieving stable write latency and reliable read/write isolation.
Conclusion and Outlook
Implementing read/write separation and shard‑level Keeper coordination resolves the ZooKeeper bottleneck, improves data integrity, and scales the event platform horizontally. Future work includes extending Keeper support to clusters with only two replicas (currently limited to a quorum of three) to accommodate log‑centric workloads that require only dual replication.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
