Why Our Redis Cluster Pipeline Deadlocked: Thread Locks Explained
This article walks through a production incident where a Redis Cluster pipeline caused Dubbo threads to block and eventually deadlock, detailing the root‑cause analysis, code inspection, and verification steps using jstack, jmap, and MAT to confirm the deadlock and propose fixes.
1. Background
Redis Pipeline is an efficient batch‑command mechanism that reduces network latency and improves read/write throughput. Redis Cluster Pipeline extends this to a Redis Cluster, packaging multiple operations and sending them to several nodes at once.
The project uses a pipeline to batch‑query reservation game information from a Redis Cluster via an internal
JedisClusterPipelineutility.
2. Incident Record
An alert indicated that a Dubbo thread pool was exhausted. Only one machine showed the problem, and the number of completed tasks never increased.
Monitoring revealed that request counts dropped to zero, confirming the machine had hung. Arthas showed all 400 Dubbo threads in a
waitingstate.
3. Fault Analysis
3.1 Threads waiting for a connection
The thread stack traces showed they were blocked inside
org.apache.commons.pool2.impl.GenericObjectPool#borrowObject(long). Because the pool’s
blockWhenExhausteddefault is
trueand
borrowMaxWaitMilliswas not set (default
-1), threads waited indefinitely for an idle connection.
<code>public T borrowObject(long borrowMaxWaitMillis) throws Exception {
// ...
while (p == null) {
if (blockWhenExhausted) {
p = idleObjects.pollFirst();
if (p == null) {
if (borrowMaxWaitMillis < 0) {
p = idleObjects.takeFirst(); // blocks forever
} else {
p = idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS);
}
}
// ...
}
}
return p.getObject();
}
</code>Since the business code did not set
borrowMaxWaitMillis, threads kept waiting for a connection.
3.2 Threads unable to obtain a connection
Two possibilities were considered: inability to create a Redis connection, or all connections in the pool being occupied. Network jitter was observed, but the problematic machine could still connect to Redis, ruling out the first case.
Connection‑leakage was examined; the project uses Jedis 2.9.0, which does not exhibit the known leak in version 2.10.0.
3.3 Potential deadlock
Without a timeout, pipeline mode can cause a deadlock when threads acquire connections from multiple pools in different orders (the classic “hold‑and‑wait” condition). In the example, four threads each need connections from two Redis nodes; opposite acquisition order leads to circular waiting.
4. Deadlock Proof
4.1 Identify which pool each thread is waiting on
Using
jstackand
jmap, the lock address each thread waited for was extracted (e.g., thread 383 waiting on
0x6a3305858). MAT was then used to trace the lock back to a specific
JedisPoolinstance.
4.2 Identify which pools each thread currently holds
MAT searched for all
JedisClusterPipelineobjects (one per Dubbo thread). The
poolToJedisMapfield revealed which connection pools each pipeline held connections from.
4.3 Analyze deadlock conditions
Out of 12 Redis master nodes, all 400 Dubbo threads were waiting on only five connection pools, each configured with a size of 20. Those five pools already had 100 connections occupied, leaving no free connections for the remaining threads, confirming a deadlock.
5. Summary
The article demonstrates a systematic approach to diagnosing a production failure: capturing heap and thread dumps, using Arthas for live inspection, reading source code to understand blocking behavior, forming hypotheses, and finally confirming a deadlock with MAT by correlating waiting locks and held connections. It highlights the importance of configuring connection‑pool timeouts and sizing pools appropriately to avoid similar deadlocks.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.