How to Resolve Kafka ISR Fluctuations and High RT by Tuning Network Threads
This article walks through a real‑world Kafka incident where a sudden surge in client connections caused ISR churn, frequent connection drops, and high response time, and explains how monitoring thread idle rates and increasing network and I/O thread counts restored stability.
Event Background
One evening, after a basketball game, the on‑call operations team received an alert that a Kafka cluster’s RT (response time) had spiked and messages were backing up. Users also reported the same issue.
Problem Analysis
Inspecting the ZMS console showed an unusually high RT. Node logs revealed frequent ISR shrink‑and‑expand events on a single broker, irregular major GC pauses, and numerous connection‑reset messages. Traffic monitoring indicated that the affected node’s outbound traffic had dropped dramatically, suggesting that follower replicas were not pulling data and were being kicked out of the ISR list.
Business logs confirmed that on January 12 a large number of client connections were added, increasing TCP connections by about 4,000 per broker and causing the observed surge in connection count and RT.
Investigation and Solution
To understand the root cause, the Kafka network thread model was reviewed. Kafka uses a 1 Acceptor + N Processor + M Handler thread architecture. The relevant JMX metrics are:
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent– Processor thread idle percent.
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent– Handler thread idle percent.
Monitoring showed the problematic broker’s Processor idle percent was near 0, indicating full saturation. The broker was configured with num.network.threads=6 and num.io.threads=16 on a 48‑core machine, yet CPU load was modest, confirming insufficient thread capacity.
The configuration was adjusted on all brokers:
num.network.threads=12 num.io.threads=32 queued.max.requests=1000(doubling the default queue size of 500)
After restarting the brokers, the Processor idle percent rose to around 0.9 even under high TPS, and CPU utilization increased modestly, indicating the brokers could now handle the load.
Conclusion
The incident was traced to a sudden influx of client connections that overwhelmed the broker’s network threads, causing request blocking, timeouts, ISR churn, and high RT. By scaling the network and I/O thread pools and enlarging the request queue, the cluster returned to stable operation. Future work includes adding global alerts for network thread idle rates to catch similar issues early.
Zhongtong Tech
Integrating industry and information for digital efficiency, advancing Zhongtong Express's high-quality development through digitalization. This is the public channel of Zhongtong's tech team, delivering internal tech insights, product news, job openings, and event updates. Stay tuned!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
