Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage
This article outlines how a large‑scale Kafka deployment of over a thousand machines across dozens of clusters was engineered for stability and efficiency through a custom Guardian controller that adds partition‑level throttling, automatic balancing, multi‑tenant isolation, cross‑IDC management, tiered storage, audit capabilities, and fully automated operational workflows.
Background
Kafka is the core data middleware in the company, supporting both big‑data and online scenarios. Over 1000 machines form more than 20 clusters, handling petabytes of data daily and serving as the backbone for data transmission.
Challenges
Client read/write patterns are highly diverse and unpredictable, causing I/O pressure and resource contention.
Multiple business workloads share clusters, leading to interference and a larger "explosion radius".
Open‑source rate‑limiting is coarse‑grained and cannot react to real‑time disk health.
Kafka broker add/remove processes are cumbersome and inefficient.
Partition allocation ignores disk load and topic traffic, resulting in load imbalance across machines and disks.
Lack of automatic balancing and migration‑rate control hampers real‑time read/write performance.
Scaling across multiple IDC locations introduces coordination challenges.
Only a single thread‑pool for brokers causes slow requests to block other traffic.
Solution – Guardian
We built “Guardian”, a self‑developed Kafka federation cluster controller. It uses Raft for high availability and collects metrics from Kafka servers to drive governance plans. Core functions include federation metadata management, remote‑storage metadata, UUID (topicId, segmentId) allocation, cluster scheduling, multi‑tenant label isolation, fault alerting and self‑healing.
Metrics collection via JMX is inefficient because each request fetches a single MBean; we therefore use a GRPC‑based Kafka Reporter that pulls all metrics in a single RPC.
1. Partition‑level throttling
Goal: keep disk I/O within safe limits while maximizing throughput. We estimate whether a read will hit disk by comparing available memory to read/write rates, then compute a time T that data can stay in PageCache. If a partition’s LEO - MessageInRate * T exceeds the requested offset, the read is served from cache; otherwise we throttle the disk.
Define six disk‑I/O behaviors: client read/write, replica sync read/write, migration read/write.
Identify abnormal behaviors (excessive writes, any reads) and rank them in an “abnormal behavior queue” by current traffic volume.
Guardian monitors these metrics, performs root‑cause analysis per partition, and applies throttling automatically.
Enabling partition‑level throttling triggers automatic disk‑migration tasks, which have significantly improved cluster stability.
2. Automatic partition balancing
We generate migration plans based on disk load, topic distribution, and per‑partition traffic. Plans select target disks with the lowest historical load median, schedule incremental migration to avoid long‑tail blocking, and dynamically adjust speed according to cluster load.
Support concurrent migrations across clusters.
Pre‑allocate partitions for new topics based on current disk load and expected traffic.
Leader balancing to avoid hotspot leaders.
Cancel migrations when nodes fail or topics need expansion.
3. Multi‑tenant resource isolation
Introduce exclusive resources per application domain. Topics can be created with a designated exclusive resource; the system allocates machines from a pool accordingly, supports dynamic scaling, and can shrink resources by migrating partitions away.
4. Multi‑IDC management
Support cross‑IDC topic migration with one‑click, configurable replica placement from cluster‑level down to topic‑level, and near‑by read routing so that requests are served by the closest replica.
5. Request‑queue splitting
Separate fast and slow request handling by assigning them to dedicated thread pools, reducing latency impact of slow partitions. Slow requests are routed to a special SlowRequest pool after being marked by ChannelMarker.
6. Tiered storage
Break the tight binding between partitions and local disks. Use HDFS as remote storage while keeping recent data on SSD. A Raft‑based meta service backed by RocksDB stores metadata (key: cluster_topic_uuid_id_partition_event_id, value: PB‑serialized bytes). The system fetches data from local cache when possible and falls back to remote storage for older segments, reducing HDFS load.
Batch meta updates with leader fencing for strong consistency.
Local cache of remote segments (one segment per read) to lower latency.
Reduced write RT jitter and lower write latency because full data no longer resides on SSD.
7. Audit and cost management
We added an audit feature that streams production and consumption details to ClickHouse in real time, enabling fine‑grained troubleshooting and a cost‑management system that automatically cleans up redundant topics, reducing waste and improving security.
Operational improvements
Automated smooth rolling upgrades: batch machine up/down without manual steps, cutting manpower from 15 people/day to 1 person/hour.
Automatic fault detection and self‑healing.
Future goals: minute‑level migration, minute‑level self‑check/self‑heal, and fully automated dynamic scaling.
Future outlook
Support minute‑level migration tasks.
Implement minute‑level self‑check and self‑heal mechanisms.
Achieve fully automated dynamic scaling based on real‑time cluster analysis.
Key source files involved in the Kafka thread model include:
core/src/main/scala/kafka/network/SocketServer.scala clients/src/main/java/org/apache/kafka/common/network/Selector.java clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java core/src/main/scala/kafka/server/KafkaRequestHandler.scalaSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
