Why Our Druid Pool Stalled: Uncovering Thread Deadlocks and Opentelemetry Bugs

After the user center’s test environment repeatedly halted all database requests, we traced the issue to Druid’s connection pool threads blocking on takeLast(), discovered opentelemetry’s global lock causing deadlock, and resolved it by upgrading Druid and adjusting maxWait settings.

Ziru Technology
Ziru Technology
Ziru Technology
Why Our Druid Pool Stalled: Uncovering Thread Deadlocks and Opentelemetry Bugs

Background

The user‑center test environment experienced a problem where, after the application started, all database‑related requests stopped responding after a while, and only a restart could temporarily fix it. The pool used was Druid 1.1.12 with MySQL 5.7, and Arthas was used for diagnosis.

Initial Investigation

First we looked at thread states and saw most threads in wait. Using thread -all we captured a thread dump.

The thread 39 output showed many threads blocked inside com.alibaba.druid.pool.DruidDataSource.takeLast(). No active connections appeared in show full processlist.

Further Analysis

Searching online revealed similar Druid bugs. Upgrading Druid resolved the issue temporarily.

However, the problem reappeared. A second investigation showed a conditional check on maxWait that chose between takeLast and pollLast. The former blocks indefinitely, while the latter gives up after maxWait. Setting maxWait prevented the deadlock.

Root Cause

Deeper digging with Arthas showed that the two blocked threads were Druid’s internal “Create” threads. In the test environment Druid used the default configuration ( maxWait=8, minIdle=0), meaning the pool kept no idle connections. When a task finished, connections were destroyed after 30 minutes.

Opentelemetry intercepted the execute method, adding a global lock when connection metadata was missing. If the pool had no idle connections, the Create thread was awakened to create a new one, but the intercepted test SQL also required the global lock, which was already held, causing a deadlock. Consequently the pool could no longer create new connections.

Solution

Upgrading Druid to a version where the bug is fixed and configuring maxWait appropriately resolved the issue. The relevant GitHub issue and the fixed code (showing computeIfAbsent acquiring a global lock) are included.

try {
    boolean emptyWait = true;
    if (createError != null && poolingCount == 0 && !discardChanged) {
        emptyWait = false;
    }
    if (emptyWait && asyncInit && createCount.get() < initialSize) {
        emptyWait = false;
    }
    if (emptyWait) {
        // Must have waiting thread to create connection
        if (poolingCount >= notEmptyWaitThreadCount
                && !(keepAlive && activeCount + poolingCount < minIdle)) {
            // wait for empty connection signal
            empty.await();
        }
        // Prevent creating more than maxActive connections
        if (activeCount + poolingCount >= maxActive) {
            empty.await();
            continue;
        }
    }
} catch (InterruptedException e) {
    lastCreateError = e;
    lastErrorTimeMillis = System.currentTimeMillis();
    if (!closing) {
        LOG.error("create connection Thread Interrupted, url: " + jdbcUrl, e);
    }
    break;
} finally {
    lock.unlock();
}
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

debuggingJavaDruidThread Deadlock
Ziru Technology
Written by

Ziru Technology

Ziru Official Tech Account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.