Databases 12 min read

Comprehensive Approach to Slow SQL Detection and Governance

The Taobao platform’s slow‑SQL governance team implemented a comprehensive detection and governance pipeline—combining internal slow‑log tools, database slow‑query logs, and JVM‑Sandbox instrumentation to capture full SQL details, scoring high‑risk queries by execution time, scans, and standards violations, then prioritizing remediation through health scores, branch‑diff checks, and issue tracking—significantly cutting DB‑related incidents and boosting system stability.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Comprehensive Approach to Slow SQL Detection and Governance

In the past two years, the team responsible for slow‑SQL governance in the large‑scale Taobao platform faced the challenge of quickly and accurately discovering all slow SQL statements across multiple development and production environments.

Slow SQL is defined as any statement whose execution time exceeds 1 second. Such queries can cause response latency, increased resource consumption, connection pool exhaustion, lock contention and data inconsistency, threatening system stability.

Common causes include missing or ineffective indexes, large table scans, poorly written SQL (excessive joins, sub‑queries, IN lists, deep pagination, ORDER BY, GROUP BY), dirty page flushing, and lock waiting.

Detection methods:

Internal Alibaba tools (TDDL slow‑log, DB service center APIs) for queries longer than 1 s.

Enabling slow‑query logs in the database for external users.

Using JVM‑Sandbox to instrument Java methods that execute SQL, capturing full SQL text, execution time, parameters, DB name, IP, and mapper resources.

The JVM‑Sandbox instrumentation is illustrated by the following code snippet, which records BEFORE, LINE, RETURN and THROWS events:

// BEFORE‑EVENT
try {
    /*
    * do something...
    */
    //LINE‑EVENT
    a();
    // RETURN‑EVENT
    return;
} catch (Throwable cause) {
    // THROWS‑EVENT
}

Collected data are sent via MetaQ, aggregated in real‑time windows, and stored for further analysis.

High‑risk SQL are identified based on:

Violation of corporate SQL standards (e.g., misuse of COUNT, ISNULL, foreign keys, stored procedures).

Full‑table scans (type="ALL"), excessive row scans, filesort operations, or index‑only scans that require back‑table lookup.

Execution time, scan rows, and frequency metrics derived from Druid SQL Parser.

Governance workflow:

Existing (stock) slow SQL are prioritized using a health‑score that considers execution count, rows scanned, and duration, then grouped by application and department.

Incremental slow SQL introduced in development are detected via branch diff against master, linking sql_map changes to the responsible developer, enabling pre‑release gate checks.

Issues for high‑risk statements are created and tracked; overdue issues affect team health scores.

The overall process dramatically reduced DB‑related incidents and improved system stability.

MonitoringPerformanceSQLDatabaseJVM sandboxgovernance
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.