Big Data 30 min read

Kafka Stability Challenges and Governance Framework at Soul

This article analyzes the role, application scenarios, stability challenges, and comprehensive governance framework of Apache Kafka at Soul, covering deployment, configuration, monitoring, standard controls, common misuse, and future directions toward cloud‑native solutions.

Soul Technical Team
Soul Technical Team
Soul Technical Team
Kafka Stability Challenges and Governance Framework at Soul

Introduction

Apache Kafka is a distributed streaming platform designed for high‑throughput data streams, acting as a bridge between upstream and downstream systems.

Kafka Overview and Functions

Key functions include data aggregation (collection, integration, buffering), data distribution (multiple consumers, stream processing, storage), and decoupling producers and consumers, providing fault tolerance and scalability.

Application Scenarios

Data pipelines in ETL processes.

Event‑driven architectures for microservices.

Real‑time analytics such as recommendation systems and fraud detection.

Soul Kafka Background

Versions: 1.0.1, 2.1.1, 2.5.0, etc.

Deployed on cloud ECS with high‑performance SSDs.

Default replication factor: 3.

Retention: 3 days, SLA: 99.9%.

Total storage: 4.6 PB.

Self‑Built Kafka Background

Cost control by avoiding cloud usage fees.

Full customization and high controllability.

Avoidance of vendor lock‑in.

Improved stability and reliability through direct management.

Performance tuning for specific hardware and network conditions.

Stability Challenges

Rapid business growth increased load and data‑flow complexity, exposing issues such as poor cluster availability, long fault recovery times, frequent failures, and large business impact.

Impact Scope and Frequent Cases

Kafka cluster disk failures causing ISR switches and data backlog.

ZooKeeper leader election split‑brain triggered by broker bugs.

Controller switch taking over 20 minutes.

Network jitter between different data‑center brokers causing ISR instability.

Business logic overload (10× expected traffic) hitting performance bottlenecks.

Lack of exception handling leading to application crashes on Kafka hiccups.

Key Problem Summary

P0 – Ensure Cluster Availability : Prioritize critical business, streamline data call chains, replace key clusters with high‑performance SSDs, decommission low‑version clusters, and define rapid fault‑mitigation procedures.

P1 – Reduce Fault Frequency : Lower disk‑failure impact, control partition size, limit topic count, enforce data expiration policies, and manage large‑partition volumes.

Stability Governance Framework

Systematic approach covering deployment, configuration, monitoring, and standard controls to improve Kafka reliability.

Cluster Deployment Issues

Multiple businesses sharing a single cluster without physical isolation leads to performance interference, poor isolation, security risks, maintenance complexity, and uneven resource management. Best practice: deploy independent clusters per business scenario.

Multiple Kafka clusters sharing a single ZooKeeper cluster caused overload and split‑brain; solution: allocate a dedicated ZooKeeper per Kafka cluster.

Cross‑AZ (availability zone) deployment introduced network latency and instability; recommendation: keep all brokers in the same AZ and machine type.

Low‑version bugs (e.g., 1.0.1) caused frequent issues; upgraded to stable 2.5.0.

Manual deployment was slow and error‑prone; automation reduced provisioning time to minutes.

Inconsistent resource allocation; introduced a formula based on I/O and storage to standardize resource requests.

Cluster Configuration Issues

Overemphasis on reliability – balance ACK levels and unclean leader election based on business criticality (high ACK for core services, leader‑only for less critical).

ZooKeeper dirty data – delete‑topic command without enabling delete.topic caused stale metadata, leading to leader election failures and prolonged controller switches.

Monitoring and Alerting

Comprehensive dashboards include ZooKeeper latency, broker health, JVM metrics, producer/consumer latency distributions, request volumes, broker load balance, topic read/write rates, and lag metrics, all visualized with images.

Standard Control Items

The Kafka SLA defines limits such as no cross‑environment calls, no version upgrades after creation, no automatic topic creation, topic ownership by the producer, total storage ≤ 20 TB, per‑partition storage ≤ 20 GB, max partitions = 512, usage spikes ≤ 3× with one‑day notice, and naming restrictions for topics and consumer groups.

Common Misuse by Business Teams

Missing retry mechanisms when calling Kafka.

Absence of exception handling in producers/consumers.

Incorrect consumer mode (assign vs. subscribe) causing data loss.

Excessive number of consumers per topic leading to resource contention and increased latency.

Pulsed large‑volume consumption causing I/O spikes and instability.

Native Kafka Pain Points and Mitigation

Low scaling efficiency – increase redundancy, lower alert thresholds, shrink data before scaling, and enforce business‑controlled data sizes.

Replica sync limitations – avoid frequent broker restarts; perform bulk disk replacements when needed.

Consumer‑group rebalancing delays – limit partition count, tune session.timeout.ms, heartbeat.interval.ms, and max.poll.interval.ms.

Future Outlook

After systematic governance, Kafka is stable with no incidents in the recent quarter. Future work focuses on unified SDKs, cost reduction, faster scaling via decoupled compute and storage, and exploring Pulsar as a cloud‑native, high‑throughput, low‑latency alternative.

monitoringoperationsStreamingKafkastability
Soul Technical Team
Written by

Soul Technical Team

Technical practice sharing from Soul

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.