Scaling Kafka to Support Millions of Partitions Without Downtime
This article explains the metadata, controller, and Zookeeper challenges of supporting a million‑plus Kafka partitions and presents practical solutions such as parallel ZK fetching, metadata‑via‑topic redesign, logical cluster assembly, and physical cluster splitting to achieve large‑scale, stable Kafka deployments.
Background
When business traffic grows, Kafka clusters often need to increase the number of topics and partitions, sometimes reaching millions of partitions. Each partition creates a Zookeeper (ZK) node, consumes broker resources, and adds load to the controller. As the partition count rises, metadata size, controller recovery time, and broker I/O performance degrade, making a single‑cluster scale‑out difficult.
Challenges
Zookeeper node count : 10,000 topics with 100 partitions each generate >2 million ZK nodes, producing ~300 MB snapshot files. Larger snapshots increase ZK restart time and slow client metadata reads.
Partition replication : Replication threads are shared among many partitions. With more partitions, disk writes become fragmented, I/O throughput drops, and ISR (in‑sync replica) fluctuations become frequent.
Controller failover latency : During a controller switch the new controller must read all partition metadata from ZK and push it to every broker. Tests show ~37 seconds to load 1 million partitions from ZK and an additional ~40 seconds for brokers to deserialize and apply the metadata.
Broker restart recovery : When a broker restarts, its leader partitions must be reassigned and the corresponding ZK nodes updated. The larger the partition set, the longer this process takes, extending overall cluster unavailability.
Solution 1 – Optimize a single Zookeeper cluster
Parallel ZK node fetching : Replace the default single‑threaded metadata pull with a multi‑threaded implementation. In a VM test, fetching metadata for 1 million partitions with one thread took ~28 s; distributing the work across five threads (each handling ~200 k partitions) reduced the time to ~14 s. This shortens controller recovery and reduces the window of unavailability.
Metadata synchronization via internal topic : Instead of writing the full metadata directly to ZK, publish it to a dedicated internal Kafka topic. Controllers write updates to this topic; brokers consume the records and refresh their in‑memory metadata. This eliminates the burst of ZK writes during failover and lowers network traffic between controller and brokers. The approach requires adding a metadata‑topic consumer on each broker but avoids heavy ZK I/O.
Solution 2 – Build a logical cluster from multiple physical clusters
To keep the existing architecture largely unchanged, a metadata aggregation layer can present a unified view of several independent Kafka clusters.
Metadata service : Designate one cluster as the primary (metadata, consumer‑offset, and transaction coordinator). The primary aggregates metadata from all secondary clusters and returns a combined response to clients. When a client requests metadata, the primary may forward the request to secondary clusters, merge the results, and send the unified metadata back.
Consumer group and transaction coordination : Store offset and transaction topics only in the primary cluster. This guarantees that all consumers see a consistent coordinator, avoiding split‑brain scenarios across clusters.
Implementation notes : The primary’s metadata API must be extended to perform remote fetches, merge topic‑partition lists, and handle versioning. The extra latency of remote fetches occurs only during metadata refresh, not on the data path, so client impact is minimal.
Solution 3 – Split an overloaded physical cluster
If a single cluster becomes too large, it can be divided into two independent clusters.
Separate the broker set into two groups. Each group connects to a distinct Zookeeper ensemble (e.g., add observer nodes for the new group).
Use Kafka’s built‑in migration tool ( kafka-reassign-partitions.sh) to move whole topics to the appropriate broker group, ensuring that all partitions of a given topic stay within the same group.
After migration, remove the observer nodes from the original ZK ensemble and form a new ZK quorum for the second broker group.
This isolation reduces controller and ZK load, improves fault‑tolerance, and keeps recovery times short.
Key Metrics from Experiments
10 k topics × 100 partitions ⇒ ~2 M ZK nodes, ~300 MB snapshot.
1 M partitions generate ~80 MB of serialized metadata (size varies with replica count and topic name length).
Controller recovery from ZK for 1 M partitions: ~37 s; broker metadata distribution: ~40 s.
Parallel metadata fetch with 5 threads: total time reduced from ~28 s to ~14 s (VM‑dependent).
Conclusion
Improving controller performance (parallel ZK fetch, metadata‑via‑topic) and assembling multiple physical clusters into a logical view are effective ways to support millions of partitions. Logical clustering requires fewer architectural changes and offers better fault‑recovery guarantees, while splitting an oversized cluster provides a straightforward path to isolate load.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
