Databases 17 min read

How ICBC Scales MySQL to 10,000 Nodes: Governance, Automation, and Performance Strategies

This article outlines Industrial and Commercial Bank of China's journey from early MySQL adoption to managing nearly ten thousand production nodes, detailing the challenges of high‑capacity, high‑concurrency workloads and the comprehensive governance framework—including standards, pre‑emptive checks, automated incident response, and future self‑healing plans—used to ensure reliable core‑application data storage.

ITPUB

Jan 8, 2021

How ICBC Scales MySQL to 10,000 Nodes: Governance, Automation, and Performance Strategies

Abstract

This presentation, based on the 2020 DTCC Database Conference, shares ICBC's experience in governing MySQL for core business data storage. Since 2014, the bank has expanded MySQL usage from peripheral low‑level applications to core high‑level services, now operating close to ten thousand MySQL nodes in production.

Speaker Introduction

Lin Zhenxi, a MySQL specialist at ICBC since 2014, introduced MySQL 5.5 and 5.7, and is responsible for MySQL architecture design, development standards, and related technical training.

Current Status and Challenges

From June 2019 to the present, the number of MySQL nodes grew dramatically, with a large proportion serving core (A‑class) applications. The bank classifies applications into four levels (A‑D), where A‑class denotes core services. High‑capacity, high‑concurrency business demands have driven rapid expansion of distributed database architectures.

Operational pressure is intense: monitoring, alerting, and fault recovery become difficult without automation. The bank co‑developed a MySQL management platform with a third‑party vendor to automate fault diagnosis and switchover, achieving fast recovery with minimal business impact.

Physical‑machine deployments waste CPU resources because MySQL typically uses low CPU. To improve utilization, about 90% of MySQL instances have been containerized, boosting CPU efficiency by 4‑5×.

Governance Approach and Solutions

The governance framework consists of three phases: pre‑emptive, in‑process, and post‑incident.

1. Standards and Their Roles

Define operational standards, e.g., every table must have a primary key; disallow custom character set or collation attributes in DDL.

Quantify controls: limit slow‑SQL scan‑to‑result ratios (e.g., ≤100:1), cap transaction updates to 100,000 rows.

Prevent known bugs, such as large‑table TRUNCATE causing hangs or REPLACE INTO leading to master‑slave metadata inconsistency.

2. Concrete Practices

Table‑structure audit enforces primary‑key creation, InnoDB usage, and prevents accidental MyISAM tables. The audit system also generates DDL for new tables and handles schema changes automatically.

Code audit focuses on MyBatis‑based development, ensuring configuration consistency and scanning for risky patterns.

Health checks target slow SQL. A dedicated view records execution count, total time, and rows scanned. By sampling the view at two timestamps and computing differences, the team identifies statements with excessive execution time or row scans, then optimizes them (often by adding indexes).

3. Pre‑emptive Measures

Table‑structure and code reviews are automated through the audit system. Version control ensures consistent schema evolution.

Automated monitoring captures performance metrics; thresholds trigger automatic thread killing for runaway queries, separating online and batch workloads to avoid interference.

Large transactions (over 100,000 rows) are flagged because they monopolize the single‑threaded binlog writer, causing queueing and potential HA switchover delays. The platform automatically kills oversized transactions and gradually pilots the auto‑kill feature through staged monitoring and alerting.

Data recovery strategies include full backups combined with incremental binlog replay. For faster recovery, the team uses slave‑mode replay or industry‑standard tools that convert DML (e.g., DELETE → INSERT) and employ filesystem‑level restoration.

4. In‑Process Incident Handling

When a performance issue arises, the system automatically captures the offending thread ID via PROCESSLIST and terminates it if it exceeds predefined limits. Online and batch workloads are processed separately to respect their differing latency requirements.

Automatic killing is introduced cautiously: initial monitoring, followed by staged auto‑kill deployment.

5. Post‑Incident Diagnosis

Comprehensive data collection is essential. High‑density sampling records backend thread states for short‑term spikes, while low‑density sampling (every 15 minutes) tracks slow‑SQL metrics. These datasets support root‑cause analysis after incidents.

Future Improvement Ideas

Problem Location: Enhance pre‑production testing and real‑time detection to pinpoint issues faster.

Problem Prediction: Use performance trends and SQL execution patterns to forecast potential bottlenecks before they affect business.

Self‑Healing: Implement intelligent automation that automatically adds missing indexes or rewrites inefficient queries when performance degradation is detected.

Overall, the presentation shares practical insights into scaling MySQL for core banking workloads, emphasizing standards, automation, monitoring, and continuous improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation performance monitoring MySQL Database Governance Large-Scale Deployment

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.