Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions
This article examines the technical challenges of scaling Kafka clusters to handle millions of partitions—including Zookeeper node explosion, replication overhead, controller recovery latency, and broker restart delays—and proposes solutions such as parallel ZK fetching, metadata synchronization via internal topics, logical cluster composition, and physical cluster splitting.
For small workloads, multiple services often share a Kafka cluster, but as business scales, topics and partitions must be continuously added, and large‑scale clusters face limits that prevent direct expansion, forcing the creation of new clusters and incurring costly system changes.
To support rapid business expansion without requiring changes from the business side, clusters must be expanded, which typically means increasing the number of partitions and broker nodes, introducing several operational challenges.
Challenges
1. ZK node count : Each partition creates a Zookeeper node; with 10,000 topics each having 100 partitions, ZK nodes exceed 2 million, inflating snapshot size (~300 MB) and slowing snapshot loading and transaction log writes, affecting client response times.
2. Partition replication : Replication threads handle many partitions; as partition count grows, disk I/O becomes fragmented, replication may lag, and ISR fluctuations increase. Adding replication threads reduces per‑thread load but intensifies disk contention.
3. Controller switch latency : Failures trigger controller switches that must reload metadata from ZK and propagate it to all brokers; restoring 1 million partitions can take ~37 s for ZK metadata retrieval and ~40 s for broker deserialization, risking prolonged unavailability.
4. Broker restart/recovery time : Broker restarts require leader handoff and ZK metadata updates; more partitions lengthen leader switch times, further extending downtime.
These factors show that increasing partition count directly impacts controller recovery, disk performance, and broker restart latency.
Solutions
1. Single ZK cluster optimizations
Parallel ZK node fetching: Use multiple threads to retrieve partition metadata concurrently, reducing controller recovery from ~37 s to ~14 s in tests.
Change metadata synchronization method: Store metadata in an internal Kafka topic instead of ZK, allowing brokers to consume updates via messages, though this requires significant architectural changes.
2. Logical cluster composition
Assemble multiple physical clusters into a logical cluster by designating a primary cluster to serve metadata, consumer offsets, and transaction coordination, while other clusters handle only business messages. The primary cluster aggregates metadata from all clusters and presents a unified view to clients, minimizing impact on production paths.
Key steps include:
Metadata service: Primary cluster fetches metadata from secondary clusters, merges it, and returns to clients.
Consumer group and transaction coordination: Centralize offset and transaction services in the primary cluster to avoid cross‑cluster inconsistencies.
3. Physical cluster splitting
When a single physical cluster becomes too large, split it by dividing brokers into two groups, each connecting to separate Zookeeper ensembles (e.g., adding observer nodes). Use Kafka’s built‑in migration tools to move topics to appropriate broker groups, then form a new ZK ensemble for the split cluster.
Conclusion
Improving controller performance and assembling multiple physical clusters into a logical cluster both increase the partition capacity of a single logical cluster. Compared to extensive architectural changes, logical clustering offers smaller impact on existing Kafka setups and provides more reliable fault‑recovery times.
Ultimately, maintaining a moderate partition count per cluster and using business‑level partitioning across multiple clusters remains a practical strategy.
Author: Ding Jun, Tencent Cloud CKafka lead, with extensive experience in messaging, caching, and NoSQL infrastructure.
Additional resources: The author’s previous columns on MyBatis and Spring Boot are available; reply with Mybatis 进阶 or Spring Boot 进阶 to receive free copies.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.