Databases 12 min read

How DAS Uses SQL Behavior Analysis to Auto-Detect and Fix Database Anomalies

This article explains how the Database Autonomy Service (DAS) leverages SQL request‑behavior mining, graph‑based clustering and anomaly propagation analysis to automatically locate abnormal SQL, diagnose performance issues, and provide self‑optimizing actions for large‑scale cloud databases.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How DAS Uses SQL Behavior Analysis to Auto-Detect and Fix Database Anomalies

Business Background

DAS (Database Autonomy Service) safeguards the stable operation of millions of database instances by precisely locating abnormal SQL during runtime. Over 90% of database problems stem from anomalous requests, which can degrade performance during high‑traffic events or business changes. Similar to autonomous driving cars perceiving road images, DAS perceives user request behavior to continuously repair and optimize databases.

Problem Description

When CPU usage spikes on an instance, it is easy to find the offending SQL if the total number of SQL templates is small. However, with tens of thousands to billions of templates, the current DAS UI cannot quickly pinpoint abnormal SQL, requiring a more efficient method based on request‑behavior analysis.

Even with DAS Professional SQL Insight, the aggregated SQL templates remain massive, many sharing similar feature trends. Compressing templates by request behavior would make searching clearer.

Challenges & Difficulties

Scale Challenge: Diverse business requests make it hard to locate performance problems across many database engines.

Monitoring & Diagnosis Challenge

Root‑cause identification must be near‑real‑time for SQL‑induced performance issues.

Complex Anomaly Phenomena

Abnormal metrics often involve multiple SQL statements, making single‑SQL explanations insufficient; issues such as full‑table scans, index problems, lock contention, cache breakdowns, and kernel faults can interact.

DAS Solution

Inspiration

Backend applications often enforce idempotency (e.g., retry mechanisms) which can generate repeated SQL statements. Common mitigation includes select‑then‑insert, pessimistic/optimistic locks, distributed locks, or state‑machine‑driven workflows. An example payment‑order state machine demonstrates how a series of related SQL statements (A, B, C) execute serially or in parallel, creating observable trends.

update `bill` set status=1 where id=520 and status=0;
update `bill` set status=2 where id=520 and status=1;
update `bill` set status=3 where id=520 and status=2;

These correlated SQLs allow us to cluster massive SQL data by request‑behavior similarity and rank their contribution to performance metrics.

Process Framework

Perception Process

DAS backend fetches performance data and SQL query data from a unified DataSet Layer. Performance data undergoes multi‑metric anomaly detection and feature extraction.

Abnormal Request Localization

Using graph‑based clustering on templated SQL, the system groups SQLs by request‑behavior features and scores each group’s contribution to metric anomalies (Query Behavior Ranking).

Example: Template set X {sql_a, sql_b, sql_c} → impacts mysql.cpu_usage; Template set Y {sql_i, sql_j, sql_k} → impacts mysql.active_session.

Root‑Cause Analysis

Example chain: Bad template sql_i → lock wait → affects mysql.rows_lock_wait_time → blocks other templates → spikes active sessions → blocks CPU‑intensive SQL → CPU usage spike.

Beyond direct SQL‑metric links, DAS performs Anomaly Propagation Analysis to correlate issues such as replication lag, OOM, or kernel problems with observed SQL behavior, enriching a case library for continuous improvement.

Case Studies Based on Request‑Behavior Identification

Active Session Spike Diagnosis

When active sessions spike, DAS real‑time anomaly detection flags the time window. By selecting the related metric and ranking SQLs by total execution time, the system isolates the SQL whose execution count surged dramatically, causing session accumulation.

Final root cause: a specific SQL’s execution count jumped from ~1,000 to >8,000, slowing other queries and driving the active_session metric spike.

CPU Usage Spike Diagnosis

CPU saturation often correlates with total scanned rows. By analyzing request behavior, DAS links the CPU spike to a full‑table‑scan SQL, confirming the cause.

Final root cause: a full‑table‑scan SQL saturated CPU, leading to session buildup.

Future Plans

DAS will extend real‑time detection and abnormal request localization to more database engines. Professional users will benefit from full‑SQL detail assistance, enabling both DBAs and beginners to manage databases without barriers, achieving true self‑awareness, self‑optimization, and self‑repair for database instances.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

database autonomous drivingquery behavior analysisSQL anomaly detection
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.