Databases 17 min read

How ICBC Scales MySQL to 10,000 Nodes: Governance, Challenges, and Solutions

This article summarizes ICBC's experience of deploying nearly ten thousand MySQL nodes, detailing the current status, challenges, governance strategies—including standards, containerization, automated monitoring, emergency handling, and post‑incident analysis—and outlines future improvement plans for core‑application databases.

21CTO

Jan 8, 2021

How ICBC Scales MySQL to 10,000 Nodes: Governance, Challenges, and Solutions

Abstract

This article is based on a 2020 DTCC Database Conference presentation. ICBC began promoting MySQL in 2014 and now operates nearly ten thousand MySQL nodes in production, extending from low‑level peripheral applications to core high‑level services. It shares the bank’s governance ideas and solutions for handling core‑business data storage.

Speaker Introduction

Lin Zhenxi, a junior engineer at ICBC, started working with MySQL in 2014, introducing MySQL 5.5 and 5.7 to the bank. He is responsible for MySQL architecture design, development standards, and technical training.

Current Status and Challenges

1.1 Current Status

Since June 2019, the number of MySQL nodes has exploded, increasing severalfold in two years. A large proportion of these nodes serve core applications (A‑class), driven by the bank’s policy of using MySQL for new databases and the high demand for high‑capacity, high‑concurrency, elastic services.

Core applications are classified as A‑class, the highest level.

1.2 Challenges

The bank faces massive high‑capacity, high‑concurrency workloads. MySQL is part of a broader distributed system that includes services, soft load balancing, distributed transactions, messaging, batch processing, caching, object storage, and file storage—nine supporting platforms in total. Managing such scale puts pressure on monitoring, alerting, and fault recovery; manual handling is infeasible, so automation and intelligent tools are required.

Traditional physical‑machine deployment wastes CPU resources because MySQL typically uses low CPU. To address this, over 90% of MySQL instances have been containerized, achieving 4‑5× higher CPU utilization.

Governance Ideas and Solutions

The governance work is divided into three phases: pre‑incident prevention, incident response, and post‑incident diagnosis.

2.1 Role of Standards

Standards define operational rules (e.g., every table must have a primary key, no charset or collation in DDL, default instance attributes). They also provide quantitative controls such as limiting rows scanned vs. rows returned (e.g., slow‑SQL scan‑to‑result ratio ≤ 100:1) and capping transaction updates to 100,000 rows. Standards help avoid bugs (e.g., Truncate on large tables causing hangs, Replace‑into causing master‑slave metadata inconsistency).

2.2 Specific Standard Features

Each rule includes an explanatory note to ensure developers understand the rationale. The standards are enforced through automated tools for table‑structure review, version control, and code review (MyBatis‑based). Table‑structure review generates DDL automatically, preventing manual errors such as accidental MyISAM usage.

2.3 Three Main Workstreams

Pre‑incident Prevention : Table‑structure audits, code reviews, and automated checks to catch issues early.

Incident Response : Automated monitoring captures performance anomalies (e.g., slow SQL). When thresholds are exceeded, the system automatically kills offending threads using process‑list information. Separate handling for online and batch workloads ensures short‑lived online transactions are not impacted by batch‑type long‑running SQL.

Large transactions (over 100,000 rows) are identified and automatically terminated because they block the single‑threaded binlog writer on the master, causing cascading delays.

Post‑incident Diagnosis : Comprehensive data collection (high‑density thread metrics for short‑term spikes, low‑density slow‑SQL metrics every 15 minutes) enables root‑cause analysis. Both high‑frequency and periodic data aid in understanding performance fluctuations.

Future Improvement Directions

1. Problem Localization: Enhance detection in development, testing, and production environments to quickly pinpoint issues.

2. Problem Prediction: Use performance trends and SQL execution statistics to forecast potential bottlenecks before they affect business.

3. Self‑Healing: Leverage intelligent automation to automatically remediate problems, such as adding missing indexes when full‑table scans are detected.

Q&A Highlights

• ICBC uses the open‑source MySQL 5.7 community edition with a primary‑multiple‑replica, two‑site‑three‑center architecture. • Over 90% of MySQL instances run in Kubernetes containers on local SSD storage; the remaining ~10% run on physical machines. • The bank developed its own high‑availability platform in collaboration with a third‑party vendor, rather than using MHA.

Source: ITPub Community

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

automation Performance Monitoring containerization mysql Database Governance ICBC Large‑Scale Deployment

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.