Fixing 18B Daily Dubbo Calls: Solving Redis Connection Pool Exhaustion
When a Dubbo interface handling 18 billion daily requests began failing due to thread‑pool exhaustion, a systematic investigation uncovered traffic spikes, Redis request overload, and mis‑configured connection‑pool parameters, leading to a series of targeted fixes—including timeout adjustments, Redis scaling, and pool‑parameter tuning—that restored stability and dramatically reduced error rates.
Background
A downstream caller reported that a specific Dubbo interface was being short‑circuit‑broken at a fixed time each day, throwing exceptions because the Dubbo thread pool was exhausted. The interface handled 1.8 billion requests per day with 940 k errors, prompting an urgent optimization effort.
Quick Emergency
2.1 Fast定位
Initial system monitoring (machine, JVM memory, GC, threads) showed only minor spikes unrelated to the error time, so they were ignored. Traffic analysis revealed a sudden surge at the same time as the errors, indicating a short‑term high‑traffic burst as the likely cause.
Figures show traffic trends, degraded request volume, and the 99th‑percentile latency line of the interface.
Finding the Performance Bottleneck
3.1 Interface Flow Analysis
Flow diagram (omitted) and description:
After receiving a request, a downstream service is called via Hystrix with a 500 ms circuit‑breaker timeout.
Data is first fetched from a local cache; if missing, Redis is queried, and if still missing, the database is accessed asynchronously.
If the downstream call fails, a fallback path repeats the same cache‑then‑Redis‑then‑DB logic.
3.2 Bottleneck Investigation
1) Downstream service latency – Although the downstream P99 latency spikes above 1 s during peak traffic, the circuit‑breaker timeout (500 ms) and Dubbo timeout (100 ms) make it unlikely to be the root cause.
2) Local cache miss leading to Redis reads – Call‑chain analysis showed Redis traffic was twice the total request volume, indicating a design flaw. Code review revealed that the local cache was never consulted; data was fetched directly from Redis, causing unnecessary load.
Further Redis monitoring confirmed traffic spikes aligned with error times.
Solution
4.1 Deploy Fix Identified in 3.2.2
After fixing the cache‑miss bug, Redis request volume halved, and the 99th‑percentile latency improved, though not completely resolved.
4.2 Redis Scaling
Further analysis considered three possible causes of Redis slowness: slow queries, Redis service bottlenecks, and client configuration. No slow queries were found. Profiling showed many slow commands on setex, so the cluster was expanded from 6 to 8 masters.
Scaling reduced Redis traffic spikes but did not fully eliminate the issue, suggesting the client remained a bottleneck.
4.3 Client Parameter Optimization
Two hypotheses were examined:
Bug in client connection management.
Improper connection‑pool settings.
Source inspection of the Jedis client (which uses Commons‑Pool2) showed no connection‑management bug. The pool‑parameter analysis led to adjustments based on the Commons‑Pool2 documentation:
public String setex(final byte[] key, final int seconds, final byte[] value) { ... }Key parameters tuned included maxWaitMillis, minIdle, timeBetweenEvictionRunsMillis, and minEvictableIdleTimeMillis. After setting maxWaitMillis to 200 ms and adjusting eviction settings, requests exceeding 1 s dropped dramatically, and daily degraded requests fell from ~900 k to ~60 k.
Continuous Optimization
Further work focused on keeping all Redis write operations under 200 ms by refining Jedis connection handling. Source snippets of the pool implementation were examined to understand object borrowing, eviction, and idle‑object maintenance.
public T borrowObject(final long borrowMaxWaitMillis) throws Exception { ... }Configuration changes added
vivo.cache.depend.common.poolConfig.timeBetweenEvictionRunsMillisand
vivo.cache.depend.common.poolConfig.minEvictableIdleTimeMillisto ensure proper pool pre‑heating on cold start.
After these adjustments, Redis response times and interface latency returned to acceptable levels, as shown by the before/after charts.
Conclusion
When online issues arise, the priority is rapid service recovery through throttling, circuit‑breaking, and degradation strategies. Mastery of monitoring platforms (machine, service, interface, DB) accelerates root‑cause analysis. For Redis latency problems, examine server health, application code for bugs, and client connection‑pool settings. Properly configuring minEvictableIdleTimeMillis and related pool parameters is essential for stable performance under high traffic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
