Mastering High‑Performance Database Operations: Insights from Meituan’s Scale
This article shares Meituan’s practical strategies for efficient database operations, covering rapid business growth handling, automated DDL workflows, systematic training, capacity planning, backup automation with xtrabackup, and real‑world Q&A on monitoring, replication, and disaster recovery.
Editor
Yan Lihui @ SAP (content collection, article editing)
Dong Wei, Xiao Tianguo (review, publishing)
Guest Introduction
Hou Junwei joined Meituan in April 2013, serving as the database technology expert and lead for MySQL and Redis operations, with extensive experience in both relational and NoSQL databases.
Key Insights
Start database operations training with real‑case studies from production incidents.
Rapid business growth demands accurate capacity forecasting.
Automate DDL (create, alter, drop) within the operations system.
Use xtrabackup for backups and record metadata for seamless scaling.
Ensure master and slave hardware, parameters, and permissions are identical to avoid operational risks.
Topic Overview
Database operations remain critical in modern internet services. With Meituan’s explosive growth—half‑year revenue of 47 billion RMB and 130 million active buyers in 2014—efficient DB ops have become a new challenge.
This discussion focuses on three aspects: handling rapid business growth, building an automated database operations system, and implementing backup and fast‑recovery mechanisms.
How to Handle Rapid Business Growth?
The fast‑growing business exhibits three traits: rapid personnel expansion, fast‑increasing workload, and frequent business changes.
Personnel Expansion Fast
New hires often have limited experience, leading to many incidents. We address this with systematic training covering SQL tuning, MySQL internals, case‑study learning, and the automation platform.
Case studies use real incidents—problematic SQL, aggressive updates causing slave lag, large transactions holding MDL locks—to illustrate pitfalls.
Training materials are also organized in a wiki for continuous knowledge sharing.
Business Growth Fast
We perform monthly (or even weekly for very fast services) capacity planning to ensure the database can handle the steep growth curve.
We collect metrics (CPU, disk, network, I/O, read/write QPS) and use the 10‑minute max average as the current load baseline.
Benchmarking (generic and customized via tcpcopy) determines the maximum supported load. For example, simple PK queries on MySQL can reach 40 k QPS, but generic benchmarks would be overly conservative.
Business Change Fast
Frequent schema changes require automated DDL handling.
We integrate DDL (create, alter, drop) into the automation system, ensuring safety and speed.
During create, we enforce naming conventions, charset, storage engine, and comments.
For alter, we use pt-online-schema-change to avoid table locks and slave lag.
When drop is needed, we first rename the table, back it up, then delete, providing a rollback window.
Database Automation Operations System
The system includes self‑service DDL, self‑service permission granting, information query, and reporting.
Self‑service permission granting lets developers request access; the system generates a random password and routes the request to DBA for approval.
The information query module extracts data from information_schema (with innodb_stats_on_metadata=OFF) so developers can view table structures, row counts, data size, index size, sort by largest tables, and search across schemas.
Backup and Fast Recovery Mechanism
Backups are performed with xtrabackup; after each backup we record metadata (when, where, status, source) to enable automated scaling.
We use full backups with tiered policies based on business criticality, applying apply‑log after each backup. The latest backup remains uncompressed for instant MySQL startup; older backups are compressed with lz4.
Restoration runs directly on the backup storage server by launching MySQL with the backup data.
If the database is too large and backup duration encroaches on peak hours, we may dedicate a slave for backup, though this consumes extra resources. Limiting database size simplifies operations.
Q&A Highlights
Q1: Where are backups stored?
A1: Currently on a RAID‑5 large‑disk server; later we may move older archives to Swift.
Q2: How many slaves per production master? Are master and slave hardware identical?
A2: Slave count varies; all masters and slaves share identical hardware, parameters, and permissions to avoid inconsistencies.
Q3: Can backup keep up with many databases? Any off‑site backup?
A3: Multiple backup machines run in parallel, but occasional bottlenecks remain; off‑site backup is still being optimized.
Q4: Monitoring system and data sharding?
A4: We use Zabbix now, planning to add other open‑source monitors. Sharding is done via MySQL partitioning, application‑level sharding, and a new proxy.
Q5: What does tcpcopy‑based custom load testing involve?
A5: It amplifies live traffic to observe replication thread behavior under increased load.
Q6: Frequency of disaster‑recovery drills?
A6: Weekly data‑restore exercises based on backup data; full disaster drills are not yet scheduled.
Q7: Do you use a domain name for the DB service?
A7: No domain name is used.
Q8: After a master‑slave failover, is the original master restored to primary?
A8: No; we keep configurations identical and minimize manual operations to reduce risk.
Q9: How is master‑slave inconsistency handled during switchover?
A9: We are developing GTID‑based automatic switchover; currently it is manual, and if a slave lags too far we rebuild it.
Q10: Does MySQL support synchronous replication?
A10: Yes, semi‑synchronous replication is available; for stronger consistency you can use Percona XtraDB Cluster, though it adds latency.
Q11: When are sharding and capacity planning performed? What is typical replication lag?
A11: Sharding is planned ahead when possible; capacity assessments trigger additional planning. Core services have near‑zero lag with SMS alerts; some analytics workloads tolerate larger lag.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
