Why Our Druid Pool Stalled: Uncovering Thread Deadlocks and Opentelemetry Bugs
After the user center’s test environment repeatedly halted all database requests, we traced the issue to Druid’s connection pool threads blocking on takeLast(), discovered opentelemetry’s global lock causing deadlock, and resolved it by upgrading Druid and adjusting maxWait settings.
Background
The user‑center test environment experienced a problem where, after the application started, all database‑related requests stopped responding after a while, and only a restart could temporarily fix it. The pool used was Druid 1.1.12 with MySQL 5.7, and Arthas was used for diagnosis.
Initial Investigation
First we looked at thread states and saw most threads in wait. Using thread -all we captured a thread dump.
The thread 39 output showed many threads blocked inside com.alibaba.druid.pool.DruidDataSource.takeLast(). No active connections appeared in show full processlist.
Further Analysis
Searching online revealed similar Druid bugs. Upgrading Druid resolved the issue temporarily.
However, the problem reappeared. A second investigation showed a conditional check on maxWait that chose between takeLast and pollLast. The former blocks indefinitely, while the latter gives up after maxWait. Setting maxWait prevented the deadlock.
Root Cause
Deeper digging with Arthas showed that the two blocked threads were Druid’s internal “Create” threads. In the test environment Druid used the default configuration ( maxWait=8, minIdle=0), meaning the pool kept no idle connections. When a task finished, connections were destroyed after 30 minutes.
Opentelemetry intercepted the execute method, adding a global lock when connection metadata was missing. If the pool had no idle connections, the Create thread was awakened to create a new one, but the intercepted test SQL also required the global lock, which was already held, causing a deadlock. Consequently the pool could no longer create new connections.
Solution
Upgrading Druid to a version where the bug is fixed and configuring maxWait appropriately resolved the issue. The relevant GitHub issue and the fixed code (showing computeIfAbsent acquiring a global lock) are included.
try {
boolean emptyWait = true;
if (createError != null && poolingCount == 0 && !discardChanged) {
emptyWait = false;
}
if (emptyWait && asyncInit && createCount.get() < initialSize) {
emptyWait = false;
}
if (emptyWait) {
// Must have waiting thread to create connection
if (poolingCount >= notEmptyWaitThreadCount
&& !(keepAlive && activeCount + poolingCount < minIdle)) {
// wait for empty connection signal
empty.await();
}
// Prevent creating more than maxActive connections
if (activeCount + poolingCount >= maxActive) {
empty.await();
continue;
}
}
} catch (InterruptedException e) {
lastCreateError = e;
lastErrorTimeMillis = System.currentTimeMillis();
if (!closing) {
LOG.error("create connection Thread Interrupted, url: " + jdbcUrl, e);
}
break;
} finally {
lock.unlock();
}Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
