Root Cause Analysis of MySQL OOM in Tencent Cloud CDB Instances
This article details a comprehensive investigation of MySQL out‑of‑memory (OOM) alerts on a Tencent Cloud CDB instance during a high‑traffic marketing event, covering background, observed symptoms, step‑by‑step debugging, analysis of server and engine memory usage, relevant MySQL bugs, fetch_cache behavior, and the final configuration recommendations to prevent OOM.
Authors: Zhang Yuan (Tencent CDB Senior Engineer) and Yu Chengzhen (Weimeng DBA Lead).
Background: Weimeng provides cloud‑based commercial and marketing solutions for SMEs. On July 17, a client’s "Super Marketing Day" required flawless service, but several MySQL OOM alerts were received from a Tencent Cloud CDB instance. The DBA team handed the issue to Tencent’s DBA experts, who collaborated with the CDB team to reproduce and investigate the problem.
Phenomenon
During concurrent execution of a series of SQL statements, memory usage continuously grew, exceeding the instance’s 15 GB limit and eventually causing OOM.
Investigation Process
The Tencent CDB team took over the case, suspecting SQL issues. They enabled database audit, excluded special SQL, turned on performance_schema, and began source‑code analysis.
3.1 Server‑Level Memory Analysis
Using the global status variable total_server_memory_used, the server‑level memory was only about 4 GB, showing no abnormality.
Thread‑level memory from SHOW FULL PROCESSLIST also showed no issues.
3.2 Offline SQL Analysis
All audited SQL statements were simple, without joins, and used indexes appropriately. Individual and concurrent tests did not reproduce memory spikes.
3.3 Official MySQL OOM‑Related Bugs
Two bugs from MySQL 5.7 were examined:
InnoDB: A dangling pointer caused a memory leak (Bug #28693568).
A query using a dynamic range and index‑merge could allocate excessive memory (Bug #89953, Bug #27659490).
Bug #28693568 is a memory‑leak in the embedded server and does not affect CDB 5.7. Bug #27659490 concerns index‑merge memory usage, but our workload does not use index‑merge, and the memory increase observed is on the engine layer, not the server layer.
Relevant commit excerpts:
commit 2a07e8d69f35a94fb0133011e3ea84ca6072171d
Author: Tor Didriksen <[email protected]>
Date: Mon Sep 24 13:16:23 2018 +0200
Bug#28693568 MEMORY LEAK IN EMBEDDED SERVER
Backport patch from 8.0 to avoid dangling char* pointer for filepath.
Change-Id: I049dceaec9103b9ba93e58732b92c412459ead7a commit 78e067e69ab7afa9f02b03e1c3f398e4da4d7377
Author: Kailasnath Nagarkar <[email protected]>
Date: Thu Aug 30 17:19:34 2018 +0530
Bug #27659490 : SELECT USING DYNAMIC RANGE AND INDEX MERGE USE TOO MUCH MEMORY(OOM)
Issue:
While creating a handler object, index‑merge access creates it in statement MEM_ROOT.
When used with "Dynamic range access method", the range optimizer is invoked multiple times, causing MySQL to consume a lot of memory.
Solution:
Instead of using statement MEM_ROOT to allocate the handler object, use the local MEM_ROOT of the range optimizer, which is destroyed at the end of the optimizer's usage.3.4 Enabling performance_schema
Turning on performance_schema incurs ~3 % performance overhead but provides detailed memory usage. After enabling, the largest memory consumers (besides buffer_pool) were memory/innodb/row0sel and memory/innodb/mem0mem.
3.5 Deep Dive into Source Code
Analysis focused on the memory allocation paths for memory/innodb/row0sel and memory/innodb/mem0mem.
3.5.1 memory/innodb/row0sel
This component allocates memory in four places, two of which are related to the fetch cache. The fetch cache stores rows fetched from the InnoDB engine to reduce server‑engine round‑trips.
Key code snippets:
sel_col_prefetch_buf_alloc
column->prefetch_buf = static_cast<sel_buf_t*>(
ut_malloc_nokey(SEL_MAX_N_PREFETCH * sizeof(sel_buf_t)));
row_sel_prefetch_cache_init
sz = UT_ARR_SIZE(prebuilt->fetch_cache) * (prebuilt->mysql_row_len + 8);
ptr = static_cast<byte*>(ut_malloc_nokey(sz));
row_search_mvcc
if (end_range_cache == NULL) {
end_range_cache = static_cast<byte*>(
ut_malloc_nokey(prebuilt->mysql_row_len));
}
row_count_rtree_recs
ulint bufsize = ut_max(UNIV_PAGE_SIZE, prebuilt->mysql_row_len);
buf = static_cast<byte*>(ut_malloc_nokey(bufsize));The fetch cache size for a table is calculated as (prebuilt->mysql_row_len + 8) * MYSQL_FETCH_CACHE_SIZE. For a table with mysql_row_len = 11458, the cache size is ~92 KB; for a larger table with mysql_row_len = 58565, it is ~548 KB.
3.5.2 memory/innodb/mem0mem
This component is the foundation for many engine‑level allocations, primarily via mem_strdup/mem_strdupl and mem_heap_create_block_func.
3.6 Index Condition Pushdown (ICP) & Fetch Cache
Stack traces showed that enabling ICP triggers fetch‑cache allocation. Disabling ICP by setting optimizer_switch='index_condition_pushdown=off' dramatically reduced memory usage without noticeable performance impact.
3.7 Root Cause
The fetch cache is allocated per row_prebuild_t object, which exists for each opened table. The total number of fetch caches equals
min(table_open_cache, table_open_cache_instances * total_table_count). In the examined instance, table_open_cache = 102400 and table_open_cache_instances = 64, with roughly 2000 tables, yielding a potential maximum of 52 GB of fetch‑cache memory.
Additionally, each table’s row_prebuild_t allocates ~7 KB from mem0mem, contributing another ~680 MB.
3.8 Recommendations
Reduce table_open_cache to 10240 and set table_open_cache_instances to 8–16. This limits the number of fetch caches and prevents the memory from exceeding the instance’s quota, thereby avoiding OOM.
Conclusion
MySQL OOM in CDB instances can stem from various sources: MySQL bugs (e.g., information_schema queries on many tables), user‑level patterns (e.g., massive UNIONs causing parser overflow), and configuration settings that amplify engine‑level memory consumption. Tencent Cloud CDB offers multiple diagnostic tools—server‑level memory stats, performance_schema, and audit logs—to pinpoint issues. The CDB team remains the strongest line of defense for cloud‑based workloads.
Tencent Database Technology
Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
