Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

A Kafka cluster at vivo suffered a total traffic drop across a resource group when a broker’s disk failed, because the default producer partitioner still hashed keys to the failed partition, exhausting client buffers and blocking all healthy partitions, prompting recommendations to avoid keys or use custom partitioners.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Kafka Cluster Fault Analysis: Root Cause and Cascading Failure Mechanism

This article details the analysis and resolution of a Kafka cluster fault at vivo, where multiple topics experienced complete traffic drop due to a single broker disk failure.

Deployment Architecture: The Kafka cluster handles trillions of messages daily, split into multiple clusters by business dimension. Each cluster contains logical "resource groups" where nodes within a group share resources while groups are isolated from each other to prevent cascading failures.

Fault Symptoms: When a disk failure occurred on a Kafka broker node, nearly all topics in that resource group experienced traffic drop to zero. This was unexpected since Kafka partitions are distributed across multiple brokers, so one broker failure should not affect all partitions.

Root Cause Analysis: The investigation revealed that the issue was not the disk failure itself, but rather a cascading effect in the Kafka producer client. The default partitioner routes messages with specified keys using hash-based modulo operations across ALL partitions (including the failed one), rather than routing only to healthy partitions.

Technical Deep Dive: The Kafka producer uses client-side buffering to batch messages before sending. When a broker becomes unavailable, the default partitioner behavior causes messages to wait for timeout on the failed broker, exhausting the shared client buffer pool. This prevents other healthy partitions from acquiring buffer resources, causing a complete traffic collapse.

Key Findings from Source Code Analysis:

If a partition is explicitly specified, messages go directly to that partition

If a key is specified, the partition is determined by hash(key) % partition_count - this routes to ALL partitions including failed ones

If no key is specified, a round-robin approach using available partitions is used

Recommendations:

Avoid specifying keys in messages unless necessary, as it can trigger cascading failures across all partitions

If keys are required, implement a custom partitioner that excludes failed brokers from routing

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemsperformance optimizationKafkatroubleshootingfault-analysis
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.