Fixing 18B Daily Dubbo Calls: Solving Redis Connection Pool Exhaustion

When a Dubbo interface handling 18 billion daily requests began failing due to thread‑pool exhaustion, a systematic investigation uncovered traffic spikes, Redis request overload, and mis‑configured connection‑pool parameters, leading to a series of targeted fixes—including timeout adjustments, Redis scaling, and pool‑parameter tuning—that restored stability and dramatically reduced error rates.

dbaplus Community
dbaplus Community
dbaplus Community
Fixing 18B Daily Dubbo Calls: Solving Redis Connection Pool Exhaustion

Background

A downstream caller reported that a specific Dubbo interface was being short‑circuit‑broken at a fixed time each day, throwing exceptions because the Dubbo thread pool was exhausted. The interface handled 1.8 billion requests per day with 940 k errors, prompting an urgent optimization effort.

Quick Emergency

2.1 Fast定位

Initial system monitoring (machine, JVM memory, GC, threads) showed only minor spikes unrelated to the error time, so they were ignored. Traffic analysis revealed a sudden surge at the same time as the errors, indicating a short‑term high‑traffic burst as the likely cause.

Figures show traffic trends, degraded request volume, and the 99th‑percentile latency line of the interface.

Finding the Performance Bottleneck

3.1 Interface Flow Analysis

Flow diagram (omitted) and description:

After receiving a request, a downstream service is called via Hystrix with a 500 ms circuit‑breaker timeout.

Data is first fetched from a local cache; if missing, Redis is queried, and if still missing, the database is accessed asynchronously.

If the downstream call fails, a fallback path repeats the same cache‑then‑Redis‑then‑DB logic.

3.2 Bottleneck Investigation

1) Downstream service latency – Although the downstream P99 latency spikes above 1 s during peak traffic, the circuit‑breaker timeout (500 ms) and Dubbo timeout (100 ms) make it unlikely to be the root cause.

2) Local cache miss leading to Redis reads – Call‑chain analysis showed Redis traffic was twice the total request volume, indicating a design flaw. Code review revealed that the local cache was never consulted; data was fetched directly from Redis, causing unnecessary load.

Further Redis monitoring confirmed traffic spikes aligned with error times.

Solution

4.1 Deploy Fix Identified in 3.2.2

After fixing the cache‑miss bug, Redis request volume halved, and the 99th‑percentile latency improved, though not completely resolved.

4.2 Redis Scaling

Further analysis considered three possible causes of Redis slowness: slow queries, Redis service bottlenecks, and client configuration. No slow queries were found. Profiling showed many slow commands on setex, so the cluster was expanded from 6 to 8 masters.

Scaling reduced Redis traffic spikes but did not fully eliminate the issue, suggesting the client remained a bottleneck.

4.3 Client Parameter Optimization

Two hypotheses were examined:

Bug in client connection management.

Improper connection‑pool settings.

Source inspection of the Jedis client (which uses Commons‑Pool2) showed no connection‑management bug. The pool‑parameter analysis led to adjustments based on the Commons‑Pool2 documentation:

public String setex(final byte[] key, final int seconds, final byte[] value) { ... }

Key parameters tuned included maxWaitMillis, minIdle, timeBetweenEvictionRunsMillis, and minEvictableIdleTimeMillis. After setting maxWaitMillis to 200 ms and adjusting eviction settings, requests exceeding 1 s dropped dramatically, and daily degraded requests fell from ~900 k to ~60 k.

Continuous Optimization

Further work focused on keeping all Redis write operations under 200 ms by refining Jedis connection handling. Source snippets of the pool implementation were examined to understand object borrowing, eviction, and idle‑object maintenance.

public T borrowObject(final long borrowMaxWaitMillis) throws Exception { ... }

Configuration changes added

vivo.cache.depend.common.poolConfig.timeBetweenEvictionRunsMillis

and

vivo.cache.depend.common.poolConfig.minEvictableIdleTimeMillis

to ensure proper pool pre‑heating on cold start.

After these adjustments, Redis response times and interface latency returned to acceptable levels, as shown by the before/after charts.

Conclusion

When online issues arise, the priority is rapid service recovery through throttling, circuit‑breaking, and degradation strategies. Mastery of monitoring platforms (machine, service, interface, DB) accelerates root‑cause analysis. For Redis latency problems, examine server health, application code for bugs, and client connection‑pool settings. Properly configuring minEvictableIdleTimeMillis and related pool parameters is essential for stable performance under high traffic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaPerformance OptimizationredisDubboConnection Pool
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.