How to Diagnose MySQL Lock Waits and Transaction Timeouts with MyAWR
This article explains how commercial banks can analyze MySQL lock‑wait events and transaction timeouts using MDL metadata lock monitoring, innodb_lock_wait_timeout, custom MyAWR collection, SQL de‑parameterization, and post‑mortem queries to pinpoint blocking SQL and its source.
1. Problem Background
Commercial banks often set transaction timeouts (e.g., 8 s or 4 s). Unoptimized SQL causing large transactions or DDL can hold record‑level or system‑level exclusive locks, blocking online transactions and triggering timeout alerts.
Although most IT operation platforms monitor MDL metadata locks and MySQL lock‑wait timeout parameters, the configured values usually exceed application‑level timeouts, leaving a gap in database‑level monitoring and post‑mortem analysis when many transaction timeouts occur.
2. Current Situation
The center currently provides the following monitoring and governance measures:
MDL metadata lock monitoring with a 15 s threshold;
MySQL
innodb_lock_wait_timeoutparameter set to 10 s;
Large‑transaction kill mechanism;
Slow‑SQL automatic kill mechanism;
The internally developed MyAWR tool can also provide minute‑level MySQL instance runtime information.
3. Problem Analysis and Action Points
Given the existing monitoring, adding extra real‑time lock‑blocking alerts is not considered necessary because:
Existing metadata lock and row‑lock timeout alerts are too short for manual intervention, and automated handling is uncertain.
Automatic handling accuracy is unclear (e.g., can the blocker be safely killed?).
Row‑lock timeout will end the waiting session, causing the application to retry, which aligns with the processing flow.
Collecting finer‑grained data at the second level may impact production performance.
The article therefore focuses on post‑mortem analysis: when many transaction timeouts occur, systematically identify the main sources of lock‑induced blocking.
In OLTP systems, many locks are generated continuously, but only locks that coincide with high concurrency, improper locking, or hot‑resource contention cause blocking. Extreme cases can lead to lock‑induced avalanches, exhausting connection pools or maxing CPU, though existing monitoring reduces this risk.
The primary goal is to use historical run records to quickly pinpoint the blocking SQL statements and their origins.
4. Analysis Method
All examples use MySQL; other databases (GaussDB, PostgreSQL) are analogous.
1. Collect Blocking Information
Based on the MyAWR tool’s lock‑wait record table:
<code>-- myawr_db_innodb_lock_waits definition
CREATE TABLE `myawr_db_innodb_lock_waits` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`snap_id` int(11) NOT NULL,
`wait_started` varchar(19) DEFAULT NULL,
`wait_age` varchar(8) DEFAULT NULL,
`wait_age_secs` bigint(21) DEFAULT NULL,
`locked_table` varchar(1024) DEFAULT NULL,
`locked_index` varchar(1024) DEFAULT NULL,
`locked_type` varchar(32) DEFAULT NULL,
`waiting_trx_id` varchar(18) DEFAULT NULL,
`waiting_trx_started` varchar(19) DEFAULT NULL,
`waiting_trx_age` varchar(8) DEFAULT NULL,
`waiting_trx_rows_locked` bigint(21) unsigned DEFAULT NULL,
`waiting_trx_rows_modified` bigint(21) unsigned DEFAULT NULL,
`waiting_pid` bigint(21) unsigned DEFAULT NULL,
`waiting_query` varchar(1024) DEFAULT NULL,
`waiting_lock_id` varchar(81) DEFAULT NULL,
`waiting_lock_mode` varchar(32) DEFAULT NULL,
`blocking_trx_id` varchar(18) DEFAULT NULL,
`blocking_pid` bigint(21) unsigned DEFAULT NULL,
`blocking_query` varchar(1024) DEFAULT NULL,
`blocking_lock_id` varchar(81) DEFAULT NULL,
`blocking_lock_mode` varchar(32) DEFAULT NULL,
`blocking_trx_started` varchar(19) DEFAULT NULL,
`blocking_trx_age` varchar(8) DEFAULT NULL,
`blocking_trx_rows_locked` bigint(21) unsigned DEFAULT NULL,
`blocking_trx_rows_modified` bigint(21) unsigned DEFAULT NULL,
`sql_kill_blocking_query` varchar(32) DEFAULT NULL,
`sql_kill_blocking_connection` varchar(26) DEFAULT NULL,
`m` int(2) NOT NULL,
PRIMARY KEY (`id`,`m`),
KEY `idx_db_innodb_lock_waits` (`snap_id`)
) ENGINE=InnoDB AUTO_INCREMENT=26448 DEFAULT CHARSET=utf8mb4;
</code>Sample collection query (compatible with MySQL 5.7 and 8.0):
<code>select
DATE_FORMAT(`wait_started`,'%Y-%m-%d %H:%i:%s'),
TIME_FORMAT(`wait_age`,'%H:%i:%s'),
`wait_age_secs`,
`locked_table`,
`locked_index`,
`locked_type`,
`waiting_trx_id`,
DATE_FORMAT(`waiting_trx_started`,'%Y-%m-%d %H:%i:%s'),
TIME_FORMAT(`waiting_trx_age`,'%H:%i:%s'),
`waiting_trx_rows_locked`,
`waiting_trx_rows_modified`,
`waiting_pid`,
`waiting_query`,
`waiting_lock_id`,
`waiting_lock_mode`,
`blocking_trx_id`,
`blocking_pid`,
IFNULL(`blocking_query`,''),
`blocking_lock_id`,
`blocking_lock_mode`,
DATE_FORMAT(`blocking_trx_started`,'%Y-%m-%d %H:%i:%s'),
TIME_FORMAT(`blocking_trx_age`,'%H:%i:%s'),
`blocking_trx_rows_locked`,
`blocking_trx_rows_modified`,
`sql_kill_blocking_query`,
`sql_kill_blocking_connection`
from sys.x$innodb_lock_waits
limit 500;
</code>Note: The tool samples every minute, so second‑level precision may be limited; adjust the
limitclause if many lock‑wait events occur.
2. SQL De‑parameterization
The collected
waiting_queryand
blocking_querycontain full SQL with concrete parameters. To group similar statements, replace literals with placeholders:
<code>UPDATE warehouse SET w_ytd = w_ytd + 1876.54 WHERE w_id = 1
UPDATE warehouse SET w_ytd = w_ytd + 3384.41 WHERE w_id = 3
</code>Desired normalized form:
<code>UPDATE warehouse SET w_ytd = w_ytd + ? WHERE w_id = ?
</code>A simple Go implementation replaces quoted strings and numeric literals with “?”:
<code>func formatSQL(sql string) string {
sarr := strings.Split(sql, "'")
for i := 1; i < len(sarr); i += 2 {
sarr[i] = "?"
}
aa := strings.Join(sarr, "'")
re := regexp.MustCompile(`\b\d+(\.\d+)?\b`)
formattedSQL := re.ReplaceAllString(aa, "?")
return formattedSQL
}
</code>Result illustration:
3. Table‑Level Blocking Analysis
Query the lock‑wait table for a specific time window to aggregate wait time per table and blocking SQL:
<code>select
w.locked_table,
sum(w.wait_age_secs) as wait_sum,
w.blocking_query
from myawr_db_innodb_lock_waits w
where w.wait_started between '2023-08-22 16:41:00' and '2023-08-22 17:00:00'
group by w.locked_table, w.blocking_query
order by wait_sum desc;
</code>Additional dimensions can be explored using other fields in the lock‑wait detail table.
Limitations:
The collection is a sample; lower sampling rates may miss events, but captured records still indicate important issues.
The
blocking_queryfield may not always be the exact statement causing the lock because locks are held per transaction, and the sampled statement could be a later one (e.g., COMMIT).
Summing
wait_age_secsdoes not reflect total blocked time accurately due to sampling intervals and possible duplicate records for the same lock.
Thus, the analysis provides qualitative insights to identify the most problematic tables and SQL during a given period.
4. Correlating Blocking SQL with Application Nodes
Using MyAWR’s historical session table, match the blocking SQL prefix to the
userfield to retrieve IP or host information, helping locate the offending application node.
<code>-- mytest.myawr_db_session definition
CREATE TABLE `myawr_db_session` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`snap_id` int(11) NOT NULL,
`thd_id` bigint(20) unsigned DEFAULT NULL,
`conn_id` bigint(20) unsigned DEFAULT NULL,
`user` varchar(128) DEFAULT NULL,
`db` varchar(64) DEFAULT NULL,
`command` varchar(16) DEFAULT NULL,
`state` varchar(64) DEFAULT NULL,
`time` bigint(20) DEFAULT NULL,
`current_statement` longtext,
`statement_latency` bigint(20) unsigned DEFAULT NULL,
`progress` decimal(26,0) DEFAULT NULL,
`lock_latency` bigint(20) unsigned DEFAULT NULL,
`rows_examined` bigint(20) unsigned DEFAULT NULL,
`rows_sent` bigint(20) unsigned DEFAULT NULL,
`rows_affected` bigint(20) unsigned DEFAULT NULL,
`tmp_tables` bigint(20) unsigned DEFAULT NULL,
`tmp_disk_tables` bigint(20) unsigned DEFAULT NULL,
`full_scan` varchar(3) DEFAULT NULL,
`last_statement` longtext,
`last_statement_latency` bigint(20) unsigned DEFAULT NULL,
`current_memory` decimal(41,0) DEFAULT NULL,
`last_wait` varchar(128) DEFAULT NULL,
`last_wait_latency` varchar(20) DEFAULT NULL,
`source` varchar(64) DEFAULT NULL,
`trx_latency` bigint(20) unsigned DEFAULT NULL,
`trx_state` varchar(20) DEFAULT NULL,
`trx_autocommit` varchar(20) DEFAULT NULL,
`pid` varchar(1024) DEFAULT NULL,
`program_name` varchar(1024) DEFAULT NULL,
`m` int(2) NOT NULL,
PRIMARY KEY (`id`,`m`),
KEY `idx_db_session` (`snap_id`)
) ENGINE=InnoDB AUTO_INCREMENT=6317 DEFAULT CHARSET=utf8mb4;
</code>5. Conclusion
The article presents a post‑mortem approach for analyzing lock‑blocking events that impact business, focusing on extracting useful insights from historical data rather than real‑time emergency handling. The same methodology can be adapted to other database platforms.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.