Big Data 21 min read

Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

Kafka’s server‑side load imbalance, caused by static replica placement on broker disks, makes manual replica migration infeasible at scale, but Cruise Control automates metric collection, analysis, and execution of fine‑grained rebalance plans—including broker de‑commissioning and leader dispersion—allowing large clusters to expand and operate efficiently.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

Replica migration is the most frequent operation in Kafka. In clusters with hundreds of thousands of replicas, manual migration is impractical. Cruise Control, an operational tool for Kafka, provides functions such as broker on‑off‑line, cluster‑wide load balancing, replica scaling, missing replica repair, and broker de‑commissioning, making large‑scale Kafka operations much easier.

1. Kafka Load Balancing

1.1 Producer load balancing – the client uses a partitioner that either assigns partitions round‑robin when no key is provided or uses a murmur2 hash of the key to select a partition. This client‑side balancing is not the focus of the article.

1.2 Consumer load balancing – when consumers join or leave, or when topic partition counts change, the KafkaConsumer interacts with the broker to trigger partition re‑assignment, ensuring a more even consumption pattern.

The default partition assignment strategies are range (continuous partitions per consumer) and round‑robin . Since Kafka 0.11.0.0 a StickyAssignor is also available, which tries to keep existing assignments while still balancing load.

However, the real load‑imbalance problem lies on the server side, not the client side.

2. Why Server‑Side Load Balancing Is Needed

Traffic distribution across brokers is often uneven. Adding a new broker does not automatically shift traffic to it, leading to hot spots on a few brokers. The article shows two figures (traffic before and after adding a broker) illustrating this issue.

The root cause is Kafka’s storage mechanism. Each broker has multiple log directories; each topic‑partition replica lives in a directory under a log directory. Replicas are bound to specific disks, and without manual intervention the placement never changes, causing persistent imbalance as topics and partitions grow.

When the number of topics and partitions grows (e.g., 7,000 topics, 130,000 partitions, 270,000 replicas), some brokers become overloaded while others stay idle. Scaling the cluster by adding brokers does not relieve the hot brokers unless replicas are migrated.

3. Manual Replica Migration

The article presents a simple scenario with two topics (T0 and T1) and shows how specific partitions can be reassigned to a newly added broker (broker3). After moving partitions T0‑P2‑R0, T0‑P3‑R1, T0‑P4‑R0, and T1‑P0‑R1, the cluster becomes more balanced. Leader switches are also demonstrated to further improve balance.

# 副本迁移脚本:kafka-reassign-partitions.sh
# 1. 配置迁移文件
$ vi topic-reassignment.json
{
  "version":1,
  "partitions":[
    {"topic":"T0","partition":2,"replicas":[broker3,broker1]},
    {"topic":"T0","partition":3,"replicas":[broker0,broker3]},
    {"topic":"T0","partition":4,"replicas":[broker3,broker1]},
    {"topic":"T1","partition":0,"replicas":[broker2,broker3]},
    {"topic":"T1","partition":2,"replicas":[broker2,broker0]}
  ]
}
# 2. 执行迁移命令
bin/kafka-reassign-partitions.sh --throttle 73400320 --zookeeper zkurl --execute --reassignment-json-file topic-reassignment.json
# 3. 查看迁移状态/清除限速配置
bin/kafka-reassign-partitions.sh --zookeeper zkurl --verify --reassignment-json-file topic-reassignment.json

4. Cruise Control – Automated Load Balancing

Cruise Control, developed by LinkedIn, automates the analysis and execution of load‑balancing plans. Its architecture consists of four components:

Monitor : collects raw Kafka metrics via a MetricsReporter, stores them in internal topics, and aggregates them per broker and partition.

Analyzer : generates migration plans based on user‑defined goals (hard goals such as rack‑aware placement, soft goals like CPU or network usage).

Executor : submits the migration plan to Kafka in batches, performing replica moves and leader switches.

Anomaly Detector : periodically checks for imbalance or missing replicas and triggers a rebalance when needed.

The article details several enhancements made for the author’s environment:

Balancing specific resource groups instead of the whole cluster.

Balancing only selected topics or partitions.

Adding a new goal – “topic‑partition leader replica dispersion” – to ensure leaders are evenly spread.

These improvements allow fine‑grained control, reducing rebalance time from weeks to minutes and preventing cross‑resource‑group interference.

5. Deployment Steps

5.1 Client side – metric collection:

Create a Kafka user for metric production/consumption.

Create three internal topics for raw JMX metrics and processed Cruise Control metrics.

Grant the user read/write permissions on those topics.

Modify server.properties to add the Cruise Control metrics reporter.

# 修改kafka的server.properties
metric.reporters=com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
cruise.control.metrics.reporter.bootstrap.servers=HOSTNAME:9092
cruise.control.metrics.reporter.security.protocol=SASL_PLAINTEXT
cruise.control.metrics.reporter.sasl.mechanism=SCRAM-SHA-256
cruise.control.metrics.reporter.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="ys" password="ys";

5.2 Server side – Cruise Control deployment:

Download the Cruise Control zip from GitHub and replace the built JARs with the customized ones.

Configure security (SASL, SCRAM), bootstrap servers, and Zookeeper connection.

Set database connection parameters for storing cluster metadata.

# 修改cruise control配置文件
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-256
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="ys" password="ys";
bootstrap.servers=HOSTNAME:9092
zookeeper.connect=zkURL
# 数据库连接配置
cluster_id=xxx
db_url=jdbc:mysql://hostxxxx:3306/databasexxx
db_user=xxx
db_pwd=xxx

After configuring, restart Kafka and start Cruise Control. The tool will continuously monitor metrics, detect imbalances, and automatically generate and execute migration plans.

6. Conclusion

The article highlights two major drawbacks of Kafka:

Each partition replica is tied to a specific disk, leading to high disk pressure.

Cluster expansion requires careful rebalancing to avoid overload on existing brokers.

Cruise Control addresses these operational challenges by providing automated, resource‑aware load balancing, leader switching, and topic configuration changes, making large‑scale Kafka clusters manageable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKafkaCluster ManagementCruise ControlReplica Migration
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.