How a Single Slow SQL Crashed Our Database: A Real‑World Postmortem
A detailed postmortem shows how a poorly optimized SQL query, triggered by a cache‑removal update, saturated MySQL's CPU and connection pool, causing cascading failures across a sales‑critical system despite low overall traffic.
1. Incident Scene
At 09:49 the application reported error 4103.ERR_ATOM_CONNECTION_POOL_FULL, indicating the DB connection pool was full. Subsequent alerts (4200, 4201, 4202) followed, and the sales workbench became unavailable because the database also handled core user‑organization permissions.
Investigation linked the outage to a cache‑package upgrade that removed local‑cache logic, forcing every request to hit the DB. The upgraded SQL, migrated from Oracle to MySQL without optimization, executed in about 2 seconds per call, creating massive slow‑query load.
SELECT CRM_USER_ID AS LOGIN_ID, CRM_ROLE_ID AS ROLE_NAME, CRM_ORG_ID AS ORG_ID, CONCAT(CRM_USER_ID, '#', CRM_ROLE_ID, '#', CRM_ORG_ID) AS URO, CONCAT(CRM_ORG_ID, '#', CRM_ROLE_ID) AS ORG_ID_ROLE_NAME FROM CRM_USER_ROLE_ORG WHERE IS_DELETED = 'n' AND CONCAT(CRM_ORG_ID, '#', CRM_ROLE_ID) = '123#abc' ORDER BY ID DESC;2. On‑site Conclusions
The upgrade removed local‑memory caching, causing all reads to go to the DB.
The SQL was not optimized for MySQL; the use of CONCAT prevented index usage, turning it into a slow query that overwhelmed the database.
3. Further Questions
Why is this particular SQL slow?
Why did a non‑core application with QPS < 1 crash the database?
Why did expanding the connection pool from 10 to 15 not prevent the collapse?
Why did a similar slow‑SQL during a gray‑release not cause failure?
4. In‑depth Analysis
4.1 Slow SQL Analysis
The query was auto‑translated from Oracle. Although the concatenated columns each have indexes, MySQL cannot use them when wrapped in a function, resulting in a full table scan of over 400 k rows (average 160 k rows scanned, 380 k logical reads) and ~2 s execution time.
4.2 Business Code Investigation
After the incident, some sales pages were still slow. The code that checks organization roles recursively traverses sub‑organizations, potentially invoking the cache/DB 1 000+ times per request. This amplified the slow query to over 15 k executions in a single request, queuing other SQLs and rendering the system unusable.
4.3 Connection Pool Details
The application uses Druid (managed by TDDL) with a maxPoolSize of 15. With seven application servers, the theoretical maximum connections are 7 × 15 = 105, far below MySQL’s 8 000 limit. However, when the slow SQL consumes threads, the server‑side thread pool (≈300 threads) becomes exhausted, triggering the 4103 error.
4.4 MySQL Performance
During the incident the CPU was fully utilized, active sessions surged, and response time (RT) increased. Logical reads dominated (no physical reads, buffer pool hit rate 100%). Each slow query scanned ~160 k rows, causing CPU saturation.
4.5 Why Slow SQL Saturates CPU
InnoDB’s buffer pool holds the entire dataset (≈6 GB < 8 GB instance), so reads are memory‑bound. Massive logical reads cause CPU to spend time scanning rows and sorting, leading to high load despite no disk I/O.
4.6 Database Scaling Considerations
Upgrading the instance (8 cores, 8 GB RAM, 100 GB disk) would not solve the problem; the bottleneck is the unoptimized SQL and the lack of caching. CPU can be expanded up to the physical host’s 64 cores, but without query improvement the issue persists.
5. Summary
Root cause: cache removal forced DB access; the migrated SQL was slow and unindexed; business logic caused massive request amplification.
Fundamental cause: slow SQL saturated MySQL CPU, increased active sessions, and triggered a cascade failure.
Two pools: Druid (application‑level) and MySQL server‑side thread pool; both became exhausted.
Fast diagnosis steps: monitor RT, active sessions, QPS, and logical read metrics; identify high‑cost SQL.
References
MySQL thread pool overview: https://dbaplus.cn/news-11-1989-1.html
Why Vim struggles with large files: https://qastack.cn/unix/139254/why-cant-vim-open-a-100-mb-text-file-when-i-have-16-gb-ram
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
