Operations 21 min read

Mastering Kafka Load Balancing with Cruise Control: From Manual Migration to Automated Optimization

This article explains why Kafka suffers from broker‑side load imbalance, walks through manual replica migration examples, and then details how Cruise Control automates load balancing, supports resource‑group targeting, leader‑replica dispersion, and provides step‑by‑step deployment instructions.

ITPUB
ITPUB
ITPUB
Mastering Kafka Load Balancing with Cruise Control: From Manual Migration to Automated Optimization

Server‑side Load Imbalance in Kafka

Kafka brokers store each [topic]-[partition] in a directory under one of the configured log.dirs. Each directory contains .index, .timeindex, .snapshot and .log files. When a cluster grows (hundreds of thousands of replicas, thousands of topics), the static routing of partitions to specific log directories leads to uneven broker load: some brokers host many hot partitions while others remain under‑utilised. Adding a new broker does not automatically shift traffic because the partition‑to‑log‑directory mapping is fixed.

Kafka storage hierarchy
Kafka storage hierarchy

Manual Replica Migration Example

Assume two topics: T0 (5 partitions, 2 replicas) and T1 (3 partitions, 2 replicas). After adding a new broker broker3 in a second rack, the following replica‑reassignment moves partitions from overloaded brokers to the new broker and also switches some leaders to improve balance.

Manual migration plan
Manual migration plan
# Replica migration script: kafka-reassign-partitions.sh
# 1. Create migration file (topic-reassignment.json)
{
  "version": 1,
  "partitions": [
    {"topic": "T0", "partition": 2, "replicas": [broker3, broker1]},
    {"topic": "T0", "partition": 3, "replicas": [broker0, broker3]},
    {"topic": "T0", "partition": 4, "replicas": [broker3, broker1]},
    {"topic": "T1", "partition": 0, "replicas": [broker2, broker3]},
    {"topic": "T1", "partition": 2, "replicas": [broker2, broker0]}
  ]
}
# 2. Execute migration (throttle 70 MiB/s)
bin/kafka-reassign-partitions.sh --throttle 73400320 \
    --zookeeper zkurl --execute \
    --reassignment-json-file topic-reassignment.json
# 3. Verify / clear throttle
bin/kafka-reassign-partitions.sh --zookeeper zkurl \
    --verify --reassignment-json-file topic-reassignment.json

Cruise Control Overview

Cruise Control (developed by LinkedIn) automates Kafka cluster operations such as broker on/off‑line, intra‑cluster load balancing, replica expansion/reduction, missing‑replica repair, and broker de‑commissioning. It consists of four core components:

Monitor : A Metrics Reporter pushes native Kafka metrics to the __CruiseControlMetrics topic; a Metrics Sampler aggregates them per broker and partition and stores the results in __KafkaCruiseControlModelTrainingSamples and __KafkaCruiseControlPartitionMetricSamples.

Analyzer : Generates migration plans based on hard goals (must‑satisfy, e.g., rack‑aware replica placement) and soft goals (preferable, e.g., balanced CPU, network, disk usage). It evaluates broker load models and selects partitions to move.

Executor : Submits the plan to Kafka, invoking the standard kafka-reassign-partitions.sh workflow for each batch.

Anomaly Detector : Periodically checks for imbalance or missing replicas and triggers a rebalance automatically.

Cruise Control architecture
Cruise Control architecture

Enhancements for Large Clusters

Resource‑group aware balancing – operators can target a subset of brokers instead of the whole cluster.

Fine‑grained topic or partition migration – moves stay within specified brokers.

New hard goal “leader‑replica dispersion” – spreads each topic’s leader replicas evenly across the cluster.

These extensions reduce rebalance time from weeks to minutes and avoid cross‑resource‑group interference.

Installation & Configuration

Client‑side metric reporter (add to each broker’s server.properties )

# server.properties additions
metric.reporters=com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
cruise.control.metrics.reporter.bootstrap.servers=host:9092
cruise.control.metrics.reporter.security.protocol=SASL_PLAINTEXT
cruise.control.metrics.reporter.sasl.mechanism=SCRAM-SHA-256
cruise.control.metrics.reporter.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="ys" password="ys";

Copy the cruise-control-metrics-reporter-*.jar into the broker lib directory and restart Kafka.

Server‑side deployment

Download the Cruise Control source from https://github.com/linkedin/cruise-control and build the project.

If a custom build is required, replace the generated cruise-control-*.jar in cruise-control/build/libs with your version.

Edit cruise-control.properties to configure security, bootstrap servers, Zookeeper, and the metadata database.

# cruise-control.properties excerpt
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-256
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="ys" password="ys";
bootstrap.servers=host:9092
zookeeper.connect=zkURL
cluster_id=xxx
db_url=jdbc:mysql://host:3306/databasexxx
db_user=xxx
db_pwd=xxx

Key Operational Concepts

Hard goals (must be satisfied) include rack‑aware replica placement and the new leader‑replica dispersion goal.

Soft goals (preferable) include balanced CPU, network throughput, disk usage, and inbound/outbound traffic.

The Analyzer evaluates each broker against the configured thresholds; if a broker exceeds a hard‑goal limit, replicas are selected for migration to under‑utilised brokers.

The Executor applies the plan in batches using the standard Kafka reassignment tool, optionally throttling the data movement.

Resulting Benefits

By automating replica and leader rebalancing, Cruise Control eliminates the need for manual kafka-reassign-partitions.sh scripts, reduces operational latency, and keeps CPU, network, and disk utilization balanced across large Kafka clusters. The resource‑group extensions allow operators to rebalance only the affected business segment, further shortening maintenance windows and improving overall cluster stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsKafkaCluster ManagementCruise ControlReplica Migration
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.