Kafka Stability Challenges and Governance Framework at Soul
This article analyzes the role, application scenarios, stability challenges, and comprehensive governance framework of Apache Kafka at Soul, covering deployment, configuration, monitoring, standard controls, common misuse, and future directions toward cloud‑native solutions.
Introduction
Apache Kafka is a distributed streaming platform designed for high‑throughput data streams, acting as a bridge between upstream and downstream systems.
Kafka Overview and Functions
Key functions include data aggregation (collection, integration, buffering), data distribution (multiple consumers, stream processing, storage), and decoupling producers and consumers, providing fault tolerance and scalability.
Application Scenarios
Data pipelines in ETL processes.
Event‑driven architectures for microservices.
Real‑time analytics such as recommendation systems and fraud detection.
Soul Kafka Background
Versions: 1.0.1, 2.1.1, 2.5.0, etc.
Deployed on cloud ECS with high‑performance SSDs.
Default replication factor: 3.
Retention: 3 days, SLA: 99.9%.
Total storage: 4.6 PB.
Self‑Built Kafka Background
Cost control by avoiding cloud usage fees.
Full customization and high controllability.
Avoidance of vendor lock‑in.
Improved stability and reliability through direct management.
Performance tuning for specific hardware and network conditions.
Stability Challenges
Rapid business growth increased load and data‑flow complexity, exposing issues such as poor cluster availability, long fault recovery times, frequent failures, and large business impact.
Impact Scope and Frequent Cases
Kafka cluster disk failures causing ISR switches and data backlog.
ZooKeeper leader election split‑brain triggered by broker bugs.
Controller switch taking over 20 minutes.
Network jitter between different data‑center brokers causing ISR instability.
Business logic overload (10× expected traffic) hitting performance bottlenecks.
Lack of exception handling leading to application crashes on Kafka hiccups.
Key Problem Summary
P0 – Ensure Cluster Availability : Prioritize critical business, streamline data call chains, replace key clusters with high‑performance SSDs, decommission low‑version clusters, and define rapid fault‑mitigation procedures.
P1 – Reduce Fault Frequency : Lower disk‑failure impact, control partition size, limit topic count, enforce data expiration policies, and manage large‑partition volumes.
Stability Governance Framework
Systematic approach covering deployment, configuration, monitoring, and standard controls to improve Kafka reliability.
Cluster Deployment Issues
Multiple businesses sharing a single cluster without physical isolation leads to performance interference, poor isolation, security risks, maintenance complexity, and uneven resource management. Best practice: deploy independent clusters per business scenario.
Multiple Kafka clusters sharing a single ZooKeeper cluster caused overload and split‑brain; solution: allocate a dedicated ZooKeeper per Kafka cluster.
Cross‑AZ (availability zone) deployment introduced network latency and instability; recommendation: keep all brokers in the same AZ and machine type.
Low‑version bugs (e.g., 1.0.1) caused frequent issues; upgraded to stable 2.5.0.
Manual deployment was slow and error‑prone; automation reduced provisioning time to minutes.
Inconsistent resource allocation; introduced a formula based on I/O and storage to standardize resource requests.
Cluster Configuration Issues
Overemphasis on reliability – balance ACK levels and unclean leader election based on business criticality (high ACK for core services, leader‑only for less critical).
ZooKeeper dirty data – delete‑topic command without enabling delete.topic caused stale metadata, leading to leader election failures and prolonged controller switches.
Monitoring and Alerting
Comprehensive dashboards include ZooKeeper latency, broker health, JVM metrics, producer/consumer latency distributions, request volumes, broker load balance, topic read/write rates, and lag metrics, all visualized with images.
Standard Control Items
The Kafka SLA defines limits such as no cross‑environment calls, no version upgrades after creation, no automatic topic creation, topic ownership by the producer, total storage ≤ 20 TB, per‑partition storage ≤ 20 GB, max partitions = 512, usage spikes ≤ 3× with one‑day notice, and naming restrictions for topics and consumer groups.
Common Misuse by Business Teams
Missing retry mechanisms when calling Kafka.
Absence of exception handling in producers/consumers.
Incorrect consumer mode (assign vs. subscribe) causing data loss.
Excessive number of consumers per topic leading to resource contention and increased latency.
Pulsed large‑volume consumption causing I/O spikes and instability.
Native Kafka Pain Points and Mitigation
Low scaling efficiency – increase redundancy, lower alert thresholds, shrink data before scaling, and enforce business‑controlled data sizes.
Replica sync limitations – avoid frequent broker restarts; perform bulk disk replacements when needed.
Consumer‑group rebalancing delays – limit partition count, tune session.timeout.ms, heartbeat.interval.ms, and max.poll.interval.ms.
Future Outlook
After systematic governance, Kafka is stable with no incidents in the recent quarter. Future work focuses on unified SDKs, cost reduction, faster scaling via decoupled compute and storage, and exploring Pulsar as a cloud‑native, high‑throughput, low‑latency alternative.
Soul Technical Team
Technical practice sharing from Soul
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.