How to Diagnose and Fix AWS RDS MySQL CPU Spikes with Performance Insights
This guide walks you through a step‑by‑step process for identifying the root cause of sudden 100% CPU usage on AWS RDS MySQL instances using Performance Insights, SHOW FULL PROCESSLIST, and Performance Schema, and provides practical tips for remediation.
First Stop: AWS Performance Insights – The Primary Responder
When a CPU alarm fires, the first place to look is the Performance Insights tab in the RDS console. It visualizes database load and lets you slice metrics by SQL, wait events, host, and user, making it ideal for pinpointing the offending user.
Open Performance Insights Log into the AWS console, navigate to RDS, select the problematic instance, and click the “Performance Insights” tab.
Select the time window Use the time selector at the top to highlight the period when CPU spiked.
Analyze Database Load The main chart shows Average Active Sessions (AAS) , which approximates the number of concurrent sessions. If the chart is mostly green, the load is CPU‑bound.
Slice by user In the “Top dimensions” area, change the “Slice by” dimension from the default “Waits” to Users .
Example: After slicing by user, you may see user_app consuming about 80% of the load, while user_admin and rdsadmin are negligible.
Conclusion: The culprit is user_app. Once identified, filter by this user, switch the dimension back to “SQL”, and the top‑ranking queries will reveal the exact statement driving the CPU spike.
Practical tip: Regularly review Performance Insights even before problems arise to catch potential bottlenecks early.
Second Stop: SHOW FULL PROCESSLIST – Real‑time Snapshot
If Performance Insights lacks detail (e.g., a newly created instance with insufficient data) or you need an immediate view, run the classic SHOW FULL PROCESSLIST command. SHOW FULL PROCESSLIST; Key columns to examine: User: the database user executing the query. Host: source IP address of the connection. db: the database being accessed. Command: current thread state (e.g., Query, Sleep, Connect). Time: duration in seconds; a large value often indicates a long‑running, CPU‑intensive query. Info: the full SQL statement; NULL means the thread is idle.
Practical tip: During CPU peaks, repeatedly run SHOW FULL PROCESSLIST. If the same user repeatedly shows a query with a growing Time value, that user is likely the source of the problem. Note that this command only captures a snapshot and may miss very fast, high‑frequency queries.
Third Stop: Performance Schema – Forensic‑level Analysis
When issues are more hidden, combine Performance Insights with SHOW PROCESSLIST or dive into the Performance Schema , a low‑overhead MySQL engine that records detailed event data.
Ensure the performance_schema parameter is ON in your RDS parameter group.
Query to identify heavy users:
-- This query aggregates total execution time and query count per user since the last reset
-- SUM_TIMER_WAIT is measured in picoseconds
SELECT
USER,
SUM(COUNT_STAR) AS total_queries,
SUM(SUM_TIMER_WAIT) / 1000000000000 AS total_execution_time_seconds
FROM performance_schema.events_statements_summary_by_user_by_event_name
WHERE EVENT_NAME LIKE 'statement/sql/%' AND USER IS NOT NULL
GROUP BY USER
ORDER BY total_execution_time_seconds DESC;The result lists users ordered by total execution time, allowing you to spot the most resource‑hungry accounts.
Practical tip: Because Performance Schema data accumulates, you can truncate the summary table before a test (
TRUNCATE TABLE performance_schema.events_statements_summary_by_user_by_event_name;) and run the query after reproducing the issue for precise measurements.
Summary and Action Plan
Prefer Performance Insights It provides the quickest visual identification of the offending user and query.
Supplement with SHOW FULL PROCESSLIST Use it for real‑time observation of long‑running queries.
Deep dive with Performance Schema Run aggregated queries to quantify each user’s resource consumption when the problem is more complex.
After locating the problematic user and SQL statement, optimize by adding indexes, rewriting the query, or reducing call frequency. Precise identification is the first step toward effective remediation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development & AI Practice
DevSecOps engineer sharing experiences and insights on AI, Web3, and Claude code development. Aims to help solve technical challenges, improve development efficiency, and grow through community interaction. Feel free to comment and discuss.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
