How eBay Scaled ClickHouse with Read/Write Separation and Keeper
This article details eBay's event monitoring platform architecture, explains the challenges of high‑load OLAP workloads on ClickHouse clusters, describes the design and implementation of read/write separation and multi‑shard Keeper coordination, and shares concrete configuration snippets, performance observations, and production lessons learned.
Problem Statement
eBay's Sherlock.io event monitoring platform stores and visualises various event signals using a ClickHouse cluster. By default, all users write to a shared ClickHouse cluster, while high‑volume customers can be assigned dedicated clusters. Over 20 ClickHouse clusters (shared and dedicated) are deployed, each with multiple shards and three replicas spread across three data centers.
When a customer's data volume spikes, the shared cluster may become unsuitable, prompting the team to spin up a dedicated ClickHouse cluster for that user.
Emerging Issues
The OLAP data case, originally stored in Druid, was migrated to the event platform. After exposing self‑service alert rules, a flood of bad queries overloaded the ClickHouse OLAP cluster, causing high load. Because ClickHouse uses a shared thread pool for reads and writes, heavy read traffic starved write operations.
To protect writes, the team proposed adding a dedicated replica for important cases, routing ingress writes to that replica and letting egress queries hit read‑only replicas, achieving read/write isolation.
Solution Overview
1. Read/Write Separation
The platform adopted a cold‑hot tiering architecture: hot data resides on local SSDs, and the ClickHouse ReplicatedMergeTree engine synchronises replicas via a centralized ZooKeeper cluster. The new design designates specific replicas per shard as write‑only and others as read‑only. A readWriteMod field in the FCHC CRD enables the feature when its value exceeds 1.
Replica assignment logic:
If {replica_num} % readWriteMod != 0, the replica becomes a read node and is added to a virtual read cluster.
If {replica_num} % readWriteMod == 0, the replica becomes a write node . Ingress prefers these replicas for writes; if unavailable, it falls back to a read replica to preserve data safety.
2. Horizontal Scaling of ClickHouse
The original bottleneck was ZooKeeper: each of the 30 billion daily inserts generated over 100 new parts per second, and 30 replicas (10 shards × 3 replicas) constantly queried ZooKeeper for metadata, pushing outstanding requests beyond 1 K. Scaling shards increased ZooKeeper load linearly.
Starting with ClickHouse v21.3, ClickHouse Keeper (a RAFT‑based replacement for ZooKeeper) can be embedded in ClickHouse servers. Keeper offers linear reads/writes, full ZooKeeper protocol compatibility, and can be deployed per shard, eliminating the centralised ZooKeeper bottleneck.
Implementation adds an enableKeeper flag to the FCHC CRD. When true, the operator generates Keeper configuration files for each ClickHouse server and can override specific Keeper settings.
enableKeeper: true
keeperConfig:
coordinationSettings:
raft_logs_level: trace
keeperNodesCount: 3
tcpPort: 9181For servers that also run a Keeper instance, the following XML snippet is injected into the ClickHouse config:
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>190</server_id>
<log_storage_path>/var/lib/keeper/log</log_storage_path>
<snapshot_storage_path>/var/lib/keeper/snapshots</snapshot_storage_path>
<raft_configuration>
<server>
<id>190</id>
<hostname>host-38-0-0</hostname>
<port>9999</port>
</server>
...
</raft_configuration>
</keeper_server>When a shard does not embed Keeper, its <zookeeper> section points to the external RAFT cluster:
<zookeeper>
<node>
<host>host-38-0-0</host>
<port>9181</port>
</node>
...
</zookeeper>3. Testing and Improvements
During testing, several issues surfaced:
Ordinary database creation failure : New ClickHouse versions deprecate the Ordinary engine in favour of Atomic. Enabling allow_deprecated_database_ordinary: "true" resolves the error.
ClickHouse server start‑up failure : The operator adds a readiness probe on port 8123, but the server waits for RAFT quorum before becoming ready, causing a dead‑lock. Adding a config‑file readiness check breaks the loop.
IP reuse during sequential pod restarts : In tight‑resource clusters, a pod from shard 0 may acquire the IP of a pod from shard 1 after deletion, causing the former to join the wrong quorum. The team mitigated this by assigning distinct ports per shard, ensuring cross‑shard quorum contamination cannot occur.
Higher latency on some shards : After enabling Keeper, most shards performed well, but a few exhibited increased latency because all clients connected to a single Keeper node. The fix was to distribute clients across multiple Keeper instances, e.g., by configuring only local Keeper servers in the <zookeeper> section.
Example of Keeper's request‑batching logic (excerpt from source code):
/// Batch all write (quorum) requests into a vector until the previous batch finishes or max_batch size is reached.
/// Reads act as a separator for writes.Production Rollout
All clusters created after the change now run with Keeper as the default coordination service. The original OLAP cluster was migrated to a Keeper‑backed ClickHouse deployment with read/write separation enabled. Post‑migration metrics show stable write latency and no data loss during UTC‑0 rotation.
Conclusion and Outlook
Implementing read/write separation on ClickHouse protects critical data from read‑induced load spikes, while replacing the central ZooKeeper with per‑shard Keeper removes the horizontal scaling bottleneck. Future work includes enabling Keeper for clusters with only two replicas (currently limited to a quorum of three) and further automating client‑side load‑balancing across Keeper nodes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
