Operations 13 min read

Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions

This article examines the technical challenges of scaling Kafka clusters to handle millions of partitions—including Zookeeper node explosion, replication overhead, controller recovery latency, and broker restart delays—and proposes solutions such as parallel ZK fetching, metadata synchronization via internal topics, logical cluster composition, and physical cluster splitting.

Code Ape Tech Column

Jan 19, 2021

Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions

For small workloads, multiple services often share a Kafka cluster, but as business scales, topics and partitions must be continuously added, and large‑scale clusters face limits that prevent direct expansion, forcing the creation of new clusters and incurring costly system changes.

To support rapid business expansion without requiring changes from the business side, clusters must be expanded, which typically means increasing the number of partitions and broker nodes, introducing several operational challenges.

Challenges

1. ZK node count : Each partition creates a Zookeeper node; with 10,000 topics each having 100 partitions, ZK nodes exceed 2 million, inflating snapshot size (~300 MB) and slowing snapshot loading and transaction log writes, affecting client response times.

2. Partition replication : Replication threads handle many partitions; as partition count grows, disk I/O becomes fragmented, replication may lag, and ISR fluctuations increase. Adding replication threads reduces per‑thread load but intensifies disk contention.

3. Controller switch latency : Failures trigger controller switches that must reload metadata from ZK and propagate it to all brokers; restoring 1 million partitions can take ~37 s for ZK metadata retrieval and ~40 s for broker deserialization, risking prolonged unavailability.

4. Broker restart/recovery time : Broker restarts require leader handoff and ZK metadata updates; more partitions lengthen leader switch times, further extending downtime.

These factors show that increasing partition count directly impacts controller recovery, disk performance, and broker restart latency.

Solutions

1. Single ZK cluster optimizations

Parallel ZK node fetching: Use multiple threads to retrieve partition metadata concurrently, reducing controller recovery from ~37 s to ~14 s in tests.

Change metadata synchronization method: Store metadata in an internal Kafka topic instead of ZK, allowing brokers to consume updates via messages, though this requires significant architectural changes.

2. Logical cluster composition

Assemble multiple physical clusters into a logical cluster by designating a primary cluster to serve metadata, consumer offsets, and transaction coordination, while other clusters handle only business messages. The primary cluster aggregates metadata from all clusters and presents a unified view to clients, minimizing impact on production paths.

Key steps include:

Metadata service: Primary cluster fetches metadata from secondary clusters, merges it, and returns to clients.

Consumer group and transaction coordination: Centralize offset and transaction services in the primary cluster to avoid cross‑cluster inconsistencies.

3. Physical cluster splitting

When a single physical cluster becomes too large, split it by dividing brokers into two groups, each connecting to separate Zookeeper ensembles (e.g., adding observer nodes). Use Kafka’s built‑in migration tools to move topics to appropriate broker groups, then form a new ZK ensemble for the split cluster.

Conclusion

Improving controller performance and assembling multiple physical clusters into a logical cluster both increase the partition capacity of a single logical cluster. Compared to extensive architectural changes, logical clustering offers smaller impact on existing Kafka setups and provides more reliable fault‑recovery times.

Ultimately, maintaining a moderate partition count per cluster and using business‑level partitioning across multiple clusters remains a practical strategy.

Author: Ding Jun, Tencent Cloud CKafka lead, with extensive experience in messaging, caching, and NoSQL infrastructure.

Additional resources: The author’s previous columns on MyBatis and Spring Boot are available; reply with Mybatis 进阶 or Spring Boot 进阶 to receive free copies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Kafka Cluster Operations partition scaling

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.