Databases 14 min read

Root Cause Analysis of MySQL OOM in Tencent Cloud CDB Instances

This article details a comprehensive investigation of MySQL out‑of‑memory (OOM) alerts on a Tencent Cloud CDB instance during a high‑traffic marketing event, covering background, observed symptoms, step‑by‑step debugging, analysis of server and engine memory usage, relevant MySQL bugs, fetch_cache behavior, and the final configuration recommendations to prevent OOM.

Tencent Database Technology
Tencent Database Technology
Tencent Database Technology
Root Cause Analysis of MySQL OOM in Tencent Cloud CDB Instances

Authors: Zhang Yuan (Tencent CDB Senior Engineer) and Yu Chengzhen (Weimeng DBA Lead).

Background: Weimeng provides cloud‑based commercial and marketing solutions for SMEs. On July 17, a client’s "Super Marketing Day" required flawless service, but several MySQL OOM alerts were received from a Tencent Cloud CDB instance. The DBA team handed the issue to Tencent’s DBA experts, who collaborated with the CDB team to reproduce and investigate the problem.

Phenomenon

During concurrent execution of a series of SQL statements, memory usage continuously grew, exceeding the instance’s 15 GB limit and eventually causing OOM.

Memory usage chart
Memory usage chart
OOM occurrence
OOM occurrence

Investigation Process

The Tencent CDB team took over the case, suspecting SQL issues. They enabled database audit, excluded special SQL, turned on performance_schema, and began source‑code analysis.

3.1 Server‑Level Memory Analysis

Using the global status variable total_server_memory_used, the server‑level memory was only about 4 GB, showing no abnormality.

Server memory breakdown
Server memory breakdown

Thread‑level memory from SHOW FULL PROCESSLIST also showed no issues.

Thread memory usage
Thread memory usage

3.2 Offline SQL Analysis

All audited SQL statements were simple, without joins, and used indexes appropriately. Individual and concurrent tests did not reproduce memory spikes.

3.3 Official MySQL OOM‑Related Bugs

Two bugs from MySQL 5.7 were examined:

InnoDB: A dangling pointer caused a memory leak (Bug #28693568).

A query using a dynamic range and index‑merge could allocate excessive memory (Bug #89953, Bug #27659490).

Bug #28693568 is a memory‑leak in the embedded server and does not affect CDB 5.7. Bug #27659490 concerns index‑merge memory usage, but our workload does not use index‑merge, and the memory increase observed is on the engine layer, not the server layer.

Relevant commit excerpts:

commit 2a07e8d69f35a94fb0133011e3ea84ca6072171d
Author: Tor Didriksen <[email protected]>
Date:   Mon Sep 24 13:16:23 2018 +0200

Bug#28693568 MEMORY LEAK IN EMBEDDED SERVER
Backport patch from 8.0 to avoid dangling char* pointer for filepath.
Change-Id: I049dceaec9103b9ba93e58732b92c412459ead7a
commit 78e067e69ab7afa9f02b03e1c3f398e4da4d7377
Author: Kailasnath Nagarkar <[email protected]>
Date:   Thu Aug 30 17:19:34 2018 +0530

Bug #27659490 : SELECT USING DYNAMIC RANGE AND INDEX MERGE USE TOO MUCH MEMORY(OOM)

Issue:
While creating a handler object, index‑merge access creates it in statement MEM_ROOT.
When used with "Dynamic range access method", the range optimizer is invoked multiple times, causing MySQL to consume a lot of memory.

Solution:
Instead of using statement MEM_ROOT to allocate the handler object, use the local MEM_ROOT of the range optimizer, which is destroyed at the end of the optimizer's usage.

3.4 Enabling performance_schema

Turning on performance_schema incurs ~3 % performance overhead but provides detailed memory usage. After enabling, the largest memory consumers (besides buffer_pool) were memory/innodb/row0sel and memory/innodb/mem0mem.

Performance schema memory
Performance schema memory

3.5 Deep Dive into Source Code

Analysis focused on the memory allocation paths for memory/innodb/row0sel and memory/innodb/mem0mem.

3.5.1 memory/innodb/row0sel

This component allocates memory in four places, two of which are related to the fetch cache. The fetch cache stores rows fetched from the InnoDB engine to reduce server‑engine round‑trips.

Key code snippets:

sel_col_prefetch_buf_alloc
column->prefetch_buf = static_cast<sel_buf_t*>(
    ut_malloc_nokey(SEL_MAX_N_PREFETCH * sizeof(sel_buf_t)));

row_sel_prefetch_cache_init
sz = UT_ARR_SIZE(prebuilt->fetch_cache) * (prebuilt->mysql_row_len + 8);
ptr = static_cast<byte*>(ut_malloc_nokey(sz));

row_search_mvcc
if (end_range_cache == NULL) {
    end_range_cache = static_cast<byte*>(
        ut_malloc_nokey(prebuilt->mysql_row_len));
}

row_count_rtree_recs
ulint bufsize = ut_max(UNIV_PAGE_SIZE, prebuilt->mysql_row_len);
buf = static_cast<byte*>(ut_malloc_nokey(bufsize));

The fetch cache size for a table is calculated as (prebuilt->mysql_row_len + 8) * MYSQL_FETCH_CACHE_SIZE. For a table with mysql_row_len = 11458, the cache size is ~92 KB; for a larger table with mysql_row_len = 58565, it is ~548 KB.

3.5.2 memory/innodb/mem0mem

This component is the foundation for many engine‑level allocations, primarily via mem_strdup/mem_strdupl and mem_heap_create_block_func.

3.6 Index Condition Pushdown (ICP) & Fetch Cache

Stack traces showed that enabling ICP triggers fetch‑cache allocation. Disabling ICP by setting optimizer_switch='index_condition_pushdown=off' dramatically reduced memory usage without noticeable performance impact.

ICP impact
ICP impact

3.7 Root Cause

The fetch cache is allocated per row_prebuild_t object, which exists for each opened table. The total number of fetch caches equals

min(table_open_cache, table_open_cache_instances * total_table_count)

. In the examined instance, table_open_cache = 102400 and table_open_cache_instances = 64, with roughly 2000 tables, yielding a potential maximum of 52 GB of fetch‑cache memory.

Additionally, each table’s row_prebuild_t allocates ~7 KB from mem0mem, contributing another ~680 MB.

3.8 Recommendations

Reduce table_open_cache to 10240 and set table_open_cache_instances to 8–16. This limits the number of fetch caches and prevents the memory from exceeding the instance’s quota, thereby avoiding OOM.

Recommended configuration
Recommended configuration

Conclusion

MySQL OOM in CDB instances can stem from various sources: MySQL bugs (e.g., information_schema queries on many tables), user‑level patterns (e.g., massive UNIONs causing parser overflow), and configuration settings that amplify engine‑level memory consumption. Tencent Cloud CDB offers multiple diagnostic tools—server‑level memory stats, performance_schema, and audit logs—to pinpoint issues. The CDB team remains the strongest line of defense for cloud‑based workloads.

MySQLDatabase OptimizationPerformance SchemaOOMTencent Cloudfetch_cache
Tencent Database Technology
Written by

Tencent Database Technology

Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.