Fast Kafka Cluster Expansion: Practical Strategies to Reduce Data Migration
When a Kafka cluster reaches load limits or experiences sudden traffic spikes, urgent expansion is needed, but data migration can be time‑consuming and risky; this guide outlines several practical techniques—including adjusting retention, adding partitions, leader switching, and single‑replica operation—to quickly scale clusters while minimizing data movement and service disruption.
Kafka clusters hit bottlenecks when node load or sudden traffic spikes require rapid expansion; newly added nodes must receive data through migration, which can be slow and increase pressure on existing nodes, potentially causing failures.
What is Data Migration?
Kafka calls it "partition reassignment" and provides the kafka-reassign-partitions.sh script. The process involves three steps:
Copy partitions from old nodes to new nodes via replica replication.
Switch the leader to the new node.
Delete the old partitions.
Data migration is essentially the data copy step, and the article uses the term to describe the whole process.
When Is Expansion Needed?
All nodes are heavily loaded and require quick scaling.
Only a few nodes are overloaded and need load reduction.
Typical pressure metrics are disk utilization, CPU, and network bandwidth. While adding nodes can relieve pressure, fine‑grained operations (parameter tuning, removing faulty nodes/disks) may also help.
Identify Top‑Traffic Topics
Most load usually comes from a small set of topics (Pareto principle). Visualizations of real‑world clusters show traffic concentrated in a few top topics. Reducing pressure on these topics yields the greatest impact.
Methods to find top topics include:
Query broker‑exposed JMX metrics and sort by inbound traffic.
Run a shell script using cmdline-jmxclient.jar to fetch and sort topic traffic.
Inspect the data directory on brokers (e.g., ll -h /data/kafka_data/) to gauge partition size.
Solution Overview: Minimize Data Migration
The core idea of the following solutions is to move as little data as possible, or avoid migration altogether, while still balancing load.
Solution 1 – Reduce Retention Time and Precise Migration
Steps:
Identify target topics.
Adjust their retention time (e.g., to 1 hour).
Migrate the selected topics.
Wait for migration to finish, then switch leaders to evenly distribute load.
Pros: Simple, works for both cluster‑wide and node‑specific overload.
Cons:
If business cannot delete data, shortening retention is not viable.
Even with short retention, the migration itself adds load to the source node, potentially slowing down the cluster.
Solution 2 – Add Partitions on Specific Nodes
When retention‑based migration is unsuitable, increase the partition count of target topics and assign the new partitions to the newly added nodes. The steps are:
Identify target topics.
Estimate how much traffic to shift to the new node (e.g., 30‑50%).
Increase the topic’s partition count (e.g., from 100 to 200) and direct the new partitions to the new node.
Pros: No data copy needed; client metadata refreshes within seconds, and producers automatically start sending to new partitions.
Cons:
If the business fixes partition count (hash‑based ordering), this method cannot be used.
Repeatedly expanding partitions can cause partition count explosion.
If clients write to a subset of partitions, the load shift may be ineffective.
Solution 3 – Leader Switching to Reduce Single‑Node Load
Identify leaders on overloaded nodes and move them to low‑load follower nodes. This operation is fast (seconds) because leader election is quick.
Find all leaders on the high‑load node.
Switch each leader to a follower on a less‑loaded node.
Pros: Immediate load reduction without data movement.
Cons:
If the follower is already heavily loaded, the switch may worsen overall pressure.
High leader load can cause replicas to fall out of ISR; switching in that state risks data loss.
Solution 4 – Run Single‑Replica Partitions
When leader switching fails, temporarily remove followers from overloaded partitions, making them single‑replica. This frees CPU, network, and disk resources used by follower sync.
Identify followers on the overloaded node.
Stop those follower processes.
Cons: No redundancy; if the node crashes, data loss or service interruption occurs. The operation must be reversed after load stabilizes.
Solution 5 – Other Approaches
Delete and recreate high‑traffic topics after adding new nodes (fast but loses unconsumed data).
Vertical scaling (upgrade VM specs) – watch for unsynced replicas and unclean.leader.election.enable settings.
Client‑side partition routing – requires producers to explicitly choose low‑load partitions.
Sync only the latest data, discarding old data during migration – reduces unnecessary copy but may cause data loss if not handled carefully.
Architectural Insight
Kafka’s tight coupling of partitions to nodes makes scaling cumbersome. In contrast, Apache Pulsar separates compute and storage and stores partitions as multiple segments via BookKeeper, allowing more flexible scaling without massive data migration.
Conclusion
The article enumerates fast‑expansion techniques that aim to minimize data movement and quickly relieve pressure. Choose the method that aligns with business tolerance for data loss, service interruption, and operational complexity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
