How a MySQL CPU Spike Exposed Critical Query Mis‑optimizations
An urgent overnight incident revealed a MySQL server’s CPU soaring to 400% due to poorly written queries, prompting a detailed analysis of execution plans, identification of costly operations like filesort and temporary tables, and concrete recommendations for query and team improvements.
Background
A social‑app startup built its backend with Java and MySQL. In early 2023 the production database server experienced a CPU utilisation of 400 % during a high‑traffic weekend, causing the service to become unresponsive.
Incident Timeline
At 02:00 am the founder reported the CPU saturation. Monitoring screenshots showed the MySQL server at full CPU capacity. The backend team provided the SQL statement that was executed by a frequently accessed C‑end API endpoint.
Root‑Cause Analysis
Running the SQL on a local replica and examining the EXPLAIN output revealed the following warning flags:
Using temporary
Using filesort
Using join buffer
Block Nested Loop
These flags indicate that MySQL had to create temporary tables, perform an external sort, and use a join algorithm that scans rows without an index. The execution plan showed full table scans for the WHERE clause and no index usage, which explains the extreme CPU consumption when the dataset approached 100 million rows. The team’s claim that the query performed well under 5 million rows and only required optimisation beyond 100 million rows contradicted standard MySQL performance expectations.
Recommended Optimisations
Avoid Using filesort and Block Nested Loop by ensuring appropriate indexes on filter columns.
Eliminate Using join buffer and Using temporary by rewriting joins to use indexed columns and by limiting result sets.
Strive for Using index (index‑only scan) in the execution plan, removing the need for full table scans.
Mitigation Steps Taken
Rolled back the newly deployed feature that introduced the problematic query.
Isolated the offending SQL, rewrote it to use indexed columns, and removed unnecessary functions from the SELECT and WHERE clauses.
Adjusted related business logic to match the revised query semantics.
Performed functional testing to verify correctness.
Conducted load testing with realistic data volumes (up to 100 million rows) to confirm that CPU usage remained within acceptable limits before redeployment.
Post‑mortem Insights
The incident demonstrates that immediate service restoration must be followed by systematic diagnosis: capture monitoring data, reproduce the query on a test replica, and analyse the execution plan. Misconceptions about MySQL scaling thresholds (e.g., “optimisation is only needed after 100 million rows”) can lead to severe performance degradation. Proper query design, comprehensive indexing, and performance testing at production‑scale data sizes are essential for reliable backend services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
