Operations 21 min read

Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

This article outlines how a large‑scale Kafka deployment of over a thousand machines across dozens of clusters was engineered for stability and efficiency through a custom Guardian controller that adds partition‑level throttling, automatic balancing, multi‑tenant isolation, cross‑IDC management, tiered storage, audit capabilities, and fully automated operational workflows.

dbaplus Community
dbaplus Community
dbaplus Community
Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

Background

Kafka is the core data middleware in the company, supporting both big‑data and online scenarios. Over 1000 machines form more than 20 clusters, handling petabytes of data daily and serving as the backbone for data transmission.

Kafka cluster overview
Kafka cluster overview

Challenges

Client read/write patterns are highly diverse and unpredictable, causing I/O pressure and resource contention.

Multiple business workloads share clusters, leading to interference and a larger "explosion radius".

Open‑source rate‑limiting is coarse‑grained and cannot react to real‑time disk health.

Kafka broker add/remove processes are cumbersome and inefficient.

Partition allocation ignores disk load and topic traffic, resulting in load imbalance across machines and disks.

Lack of automatic balancing and migration‑rate control hampers real‑time read/write performance.

Scaling across multiple IDC locations introduces coordination challenges.

Only a single thread‑pool for brokers causes slow requests to block other traffic.

Solution – Guardian

We built “Guardian”, a self‑developed Kafka federation cluster controller. It uses Raft for high availability and collects metrics from Kafka servers to drive governance plans. Core functions include federation metadata management, remote‑storage metadata, UUID (topicId, segmentId) allocation, cluster scheduling, multi‑tenant label isolation, fault alerting and self‑healing.

Guardian architecture
Guardian architecture

Metrics collection via JMX is inefficient because each request fetches a single MBean; we therefore use a GRPC‑based Kafka Reporter that pulls all metrics in a single RPC.

1. Partition‑level throttling

Goal: keep disk I/O within safe limits while maximizing throughput. We estimate whether a read will hit disk by comparing available memory to read/write rates, then compute a time T that data can stay in PageCache. If a partition’s LEO - MessageInRate * T exceeds the requested offset, the read is served from cache; otherwise we throttle the disk.

Define six disk‑I/O behaviors: client read/write, replica sync read/write, migration read/write.

Identify abnormal behaviors (excessive writes, any reads) and rank them in an “abnormal behavior queue” by current traffic volume.

Guardian monitors these metrics, performs root‑cause analysis per partition, and applies throttling automatically.

Throttling workflow
Throttling workflow

Enabling partition‑level throttling triggers automatic disk‑migration tasks, which have significantly improved cluster stability.

2. Automatic partition balancing

We generate migration plans based on disk load, topic distribution, and per‑partition traffic. Plans select target disks with the lowest historical load median, schedule incremental migration to avoid long‑tail blocking, and dynamically adjust speed according to cluster load.

Support concurrent migrations across clusters.

Pre‑allocate partitions for new topics based on current disk load and expected traffic.

Leader balancing to avoid hotspot leaders.

Cancel migrations when nodes fail or topics need expansion.

Automatic balancing diagram
Automatic balancing diagram

3. Multi‑tenant resource isolation

Introduce exclusive resources per application domain. Topics can be created with a designated exclusive resource; the system allocates machines from a pool accordingly, supports dynamic scaling, and can shrink resources by migrating partitions away.

Multi‑tenant diagram
Multi‑tenant diagram

4. Multi‑IDC management

Support cross‑IDC topic migration with one‑click, configurable replica placement from cluster‑level down to topic‑level, and near‑by read routing so that requests are served by the closest replica.

IDC management diagram
IDC management diagram

5. Request‑queue splitting

Separate fast and slow request handling by assigning them to dedicated thread pools, reducing latency impact of slow partitions. Slow requests are routed to a special SlowRequest pool after being marked by ChannelMarker.

6. Tiered storage

Break the tight binding between partitions and local disks. Use HDFS as remote storage while keeping recent data on SSD. A Raft‑based meta service backed by RocksDB stores metadata (key: cluster_topic_uuid_id_partition_event_id, value: PB‑serialized bytes). The system fetches data from local cache when possible and falls back to remote storage for older segments, reducing HDFS load.

Batch meta updates with leader fencing for strong consistency.

Local cache of remote segments (one segment per read) to lower latency.

Reduced write RT jitter and lower write latency because full data no longer resides on SSD.

Tiered storage architecture
Tiered storage architecture

7. Audit and cost management

We added an audit feature that streams production and consumption details to ClickHouse in real time, enabling fine‑grained troubleshooting and a cost‑management system that automatically cleans up redundant topics, reducing waste and improving security.

Audit overview
Audit overview

Operational improvements

Automated smooth rolling upgrades: batch machine up/down without manual steps, cutting manpower from 15 people/day to 1 person/hour.

Automatic fault detection and self‑healing.

Future goals: minute‑level migration, minute‑level self‑check/self‑heal, and fully automated dynamic scaling.

Future outlook

Support minute‑level migration tasks.

Implement minute‑level self‑check and self‑heal mechanisms.

Achieve fully automated dynamic scaling based on real‑time cluster analysis.

Key source files involved in the Kafka thread model include:

core/src/main/scala/kafka/network/SocketServer.scala
clients/src/main/java/org/apache/kafka/common/network/Selector.java
clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java
core/src/main/scala/kafka/server/KafkaRequestHandler.scala
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOperationsKafkamulti-tenantCluster Managementtiered storageauto-balancing
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.