Backend Development 29 min read

Kafka Stability Best Practices: Prevention, Monitoring, and Fault Resolution

This guide outlines Kafka stability best practices across three phases—pre‑prevention with tuning, producer/consumer guidelines, and cluster configuration; runtime monitoring using white‑box and black‑box metrics and alerts; and fault resolution strategies for backlogs, consumption blocks, and message loss, plus cost control and idempotence techniques.

Tencent Cloud Developer

Nov 24, 2022

Kafka Stability Best Practices: Prevention, Monitoring, and Fault Resolution

This article provides comprehensive best practices for ensuring Kafka stability throughout its lifecycle, organized into three main phases:

1. Pre-prevention (事先预防)

Focuses on preventing issues through standardized usage and development. Key topics include:

Kafka tuning principles: establishing quantitative optimization targets for throughput, latency, durability, and availability

Producer best practices: parameter tuning (batch.size, linger.ms, compression.type, acks), development practices (topic isolation, message flow control, message replay, order guarantee, efficiency improvement, reliability assurance)

Consumer best practices: parameter tuning, idempotence implementation, consumer isolation, avoiding message accumulation, rebalance prevention, reliability and order guarantee, transaction handling

Cluster configuration: broker and topic evaluation, partition configuration

Performance tuning: layered optimization across OS, JVM, Broker, and application layers; throughput vs latency optimization strategies

Stability testing: health checks and high availability testing

2. Runtime Monitoring (运行时监控)

Ensures cluster stability during operation with monitoring best practices:

Cluster stability monitoring: Tencent Cloud CKafka configuration, self-built Kafka cluster configuration, resource isolation (broker-level physical isolation, RPC queue isolation), intelligent rate limiting

Kafka monitoring: white-box monitoring (capacity, traffic, latency, errors) and black-box monitoring

Alert configuration: CKafka alerts, self-built alert platform, Kafka monitoring components (Kafka Manager, Kafka Monitor, CruiseControl, JMX)

3. Fault Resolution (故障时解决)

Emergency plans for handling issues:

Message backlog emergency plan: problem investigation, capacity expansion strategy, multi-threaded consumption strategy, topic transfer strategy

Consumption blocking due to exceptions: offset adjustment, multi-threaded consumption switch

Message loss plan: root cause analysis, message replay

Additional Topics

Cost control: machine selection, storage and network optimization (Zstandard compression), cluster balancing

Message consumption idempotence: using database unique constraints, setting preconditions, recording and checking operations

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development system stability Kafka performance tuning Distributed Messaging

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.