Operations 15 min read

Scaling Kafka to Support Millions of Partitions Without Downtime

This article explains the metadata, controller, and Zookeeper challenges of supporting a million‑plus Kafka partitions and presents practical solutions such as parallel ZK fetching, metadata‑via‑topic redesign, logical cluster assembly, and physical cluster splitting to achieve large‑scale, stable Kafka deployments.

Tencent Cloud Middleware

Apr 9, 2020

Scaling Kafka to Support Millions of Partitions Without Downtime

Background

When business traffic grows, Kafka clusters often need to increase the number of topics and partitions, sometimes reaching millions of partitions. Each partition creates a Zookeeper (ZK) node, consumes broker resources, and adds load to the controller. As the partition count rises, metadata size, controller recovery time, and broker I/O performance degrade, making a single‑cluster scale‑out difficult.

Challenges

Zookeeper node count : 10,000 topics with 100 partitions each generate >2 million ZK nodes, producing ~300 MB snapshot files. Larger snapshots increase ZK restart time and slow client metadata reads.

Partition replication : Replication threads are shared among many partitions. With more partitions, disk writes become fragmented, I/O throughput drops, and ISR (in‑sync replica) fluctuations become frequent.

Controller failover latency : During a controller switch the new controller must read all partition metadata from ZK and push it to every broker. Tests show ~37 seconds to load 1 million partitions from ZK and an additional ~40 seconds for brokers to deserialize and apply the metadata.

Broker restart recovery : When a broker restarts, its leader partitions must be reassigned and the corresponding ZK nodes updated. The larger the partition set, the longer this process takes, extending overall cluster unavailability.

Solution 1 – Optimize a single Zookeeper cluster

Parallel ZK node fetching : Replace the default single‑threaded metadata pull with a multi‑threaded implementation. In a VM test, fetching metadata for 1 million partitions with one thread took ~28 s; distributing the work across five threads (each handling ~200 k partitions) reduced the time to ~14 s. This shortens controller recovery and reduces the window of unavailability.

Metadata synchronization via internal topic : Instead of writing the full metadata directly to ZK, publish it to a dedicated internal Kafka topic. Controllers write updates to this topic; brokers consume the records and refresh their in‑memory metadata. This eliminates the burst of ZK writes during failover and lowers network traffic between controller and brokers. The approach requires adding a metadata‑topic consumer on each broker but avoids heavy ZK I/O.

Solution 2 – Build a logical cluster from multiple physical clusters

To keep the existing architecture largely unchanged, a metadata aggregation layer can present a unified view of several independent Kafka clusters.

Metadata service : Designate one cluster as the primary (metadata, consumer‑offset, and transaction coordinator). The primary aggregates metadata from all secondary clusters and returns a combined response to clients. When a client requests metadata, the primary may forward the request to secondary clusters, merge the results, and send the unified metadata back.

Consumer group and transaction coordination : Store offset and transaction topics only in the primary cluster. This guarantees that all consumers see a consistent coordinator, avoiding split‑brain scenarios across clusters.

Implementation notes : The primary’s metadata API must be extended to perform remote fetches, merge topic‑partition lists, and handle versioning. The extra latency of remote fetches occurs only during metadata refresh, not on the data path, so client impact is minimal.

Solution 3 – Split an overloaded physical cluster

If a single cluster becomes too large, it can be divided into two independent clusters.

Separate the broker set into two groups. Each group connects to a distinct Zookeeper ensemble (e.g., add observer nodes for the new group).

Use Kafka’s built‑in migration tool ( kafka-reassign-partitions.sh) to move whole topics to the appropriate broker group, ensuring that all partitions of a given topic stay within the same group.

After migration, remove the observer nodes from the original ZK ensemble and form a new ZK quorum for the second broker group.

This isolation reduces controller and ZK load, improves fault‑tolerance, and keeps recovery times short.

Key Metrics from Experiments

10 k topics × 100 partitions ⇒ ~2 M ZK nodes, ~300 MB snapshot.

1 M partitions generate ~80 MB of serialized metadata (size varies with replica count and topic name length).

Controller recovery from ZK for 1 M partitions: ~37 s; broker metadata distribution: ~40 s.

Parallel metadata fetch with 5 threads: total time reduced from ~28 s to ~14 s (VM‑dependent).

Conclusion

Improving controller performance (parallel ZK fetch, metadata‑via‑topic) and assembling multiple physical clusters into a logical view are effective ways to support millions of partitions. Logical clustering requires fewer architectural changes and offers better fault‑recovery guarantees, while splitting an oversized cluster provides a straightforward path to isolate load.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Zookeeper Kafka Cluster Operations partition scaling controller optimization logical cluster

Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Challenges

Solution 1 – Optimize a single Zookeeper cluster

Solution 2 – Build a logical cluster from multiple physical clusters

Solution 3 – Split an overloaded physical cluster

Key Metrics from Experiments

Conclusion

Tencent Cloud Middleware

How this landed with the community

Was this worth your time?

0 Comments

Solution 1 – Optimize a single Zookeeper cluster

Solution 2 – Build a logical cluster from multiple physical clusters

Solution 3 – Split an overloaded physical cluster