Databases 14 min read

Mastering High‑Performance Database Operations: Insights from Meituan’s Scale

This article shares Meituan’s practical strategies for efficient database operations, covering rapid business growth handling, automated DDL workflows, systematic training, capacity planning, backup automation with xtrabackup, and real‑world Q&A on monitoring, replication, and disaster recovery.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering High‑Performance Database Operations: Insights from Meituan’s Scale

Editor

Yan Lihui @ SAP (content collection, article editing)

Dong Wei, Xiao Tianguo (review, publishing)

Guest Introduction

Hou Junwei joined Meituan in April 2013, serving as the database technology expert and lead for MySQL and Redis operations, with extensive experience in both relational and NoSQL databases.

Key Insights

Start database operations training with real‑case studies from production incidents.

Rapid business growth demands accurate capacity forecasting.

Automate DDL (create, alter, drop) within the operations system.

Use xtrabackup for backups and record metadata for seamless scaling.

Ensure master and slave hardware, parameters, and permissions are identical to avoid operational risks.

Topic Overview

Database operations remain critical in modern internet services. With Meituan’s explosive growth—half‑year revenue of 47 billion RMB and 130 million active buyers in 2014—efficient DB ops have become a new challenge.

This discussion focuses on three aspects: handling rapid business growth, building an automated database operations system, and implementing backup and fast‑recovery mechanisms.

How to Handle Rapid Business Growth?

The fast‑growing business exhibits three traits: rapid personnel expansion, fast‑increasing workload, and frequent business changes.

Personnel Expansion Fast

New hires often have limited experience, leading to many incidents. We address this with systematic training covering SQL tuning, MySQL internals, case‑study learning, and the automation platform.

Case studies use real incidents—problematic SQL, aggressive updates causing slave lag, large transactions holding MDL locks—to illustrate pitfalls.

Training materials are also organized in a wiki for continuous knowledge sharing.

Training diagram
Training diagram
Wiki screenshot
Wiki screenshot

Business Growth Fast

We perform monthly (or even weekly for very fast services) capacity planning to ensure the database can handle the steep growth curve.

We collect metrics (CPU, disk, network, I/O, read/write QPS) and use the 10‑minute max average as the current load baseline.

Benchmarking (generic and customized via tcpcopy) determines the maximum supported load. For example, simple PK queries on MySQL can reach 40 k QPS, but generic benchmarks would be overly conservative.

Business Change Fast

Frequent schema changes require automated DDL handling.

We integrate DDL (create, alter, drop) into the automation system, ensuring safety and speed.

During create, we enforce naming conventions, charset, storage engine, and comments.

For alter, we use pt-online-schema-change to avoid table locks and slave lag.

When drop is needed, we first rename the table, back it up, then delete, providing a rollback window.

Database Automation Operations System

The system includes self‑service DDL, self‑service permission granting, information query, and reporting.

Self‑service permission granting lets developers request access; the system generates a random password and routes the request to DBA for approval.

The information query module extracts data from information_schema (with innodb_stats_on_metadata=OFF) so developers can view table structures, row counts, data size, index size, sort by largest tables, and search across schemas.

Information query UI
Information query UI

Backup and Fast Recovery Mechanism

Backups are performed with xtrabackup; after each backup we record metadata (when, where, status, source) to enable automated scaling.

We use full backups with tiered policies based on business criticality, applying apply‑log after each backup. The latest backup remains uncompressed for instant MySQL startup; older backups are compressed with lz4.

Backup workflow diagram
Backup workflow diagram

Restoration runs directly on the backup storage server by launching MySQL with the backup data.

If the database is too large and backup duration encroaches on peak hours, we may dedicate a slave for backup, though this consumes extra resources. Limiting database size simplifies operations.

Q&A Highlights

Q1: Where are backups stored?

A1: Currently on a RAID‑5 large‑disk server; later we may move older archives to Swift.

Q2: How many slaves per production master? Are master and slave hardware identical?

A2: Slave count varies; all masters and slaves share identical hardware, parameters, and permissions to avoid inconsistencies.

Q3: Can backup keep up with many databases? Any off‑site backup?

A3: Multiple backup machines run in parallel, but occasional bottlenecks remain; off‑site backup is still being optimized.

Q4: Monitoring system and data sharding?

A4: We use Zabbix now, planning to add other open‑source monitors. Sharding is done via MySQL partitioning, application‑level sharding, and a new proxy.

Q5: What does tcpcopy‑based custom load testing involve?

A5: It amplifies live traffic to observe replication thread behavior under increased load.

Q6: Frequency of disaster‑recovery drills?

A6: Weekly data‑restore exercises based on backup data; full disaster drills are not yet scheduled.

Q7: Do you use a domain name for the DB service?

A7: No domain name is used.

Q8: After a master‑slave failover, is the original master restored to primary?

A8: No; we keep configurations identical and minimize manual operations to reduce risk.

Q9: How is master‑slave inconsistency handled during switchover?

A9: We are developing GTID‑based automatic switchover; currently it is manual, and if a slave lags too far we rebuild it.

Q10: Does MySQL support synchronous replication?

A10: Yes, semi‑synchronous replication is available; for stronger consistency you can use Percona XtraDB Cluster, though it adds latency.

Q11: When are sharding and capacity planning performed? What is typical replication lag?

A11: Sharding is planned ahead when possible; capacity assessments trigger additional planning. Core services have near‑zero lag with SMS alerts; some analytics workloads tolerate larger lag.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Opsmysqlcapacity planningBackupDatabase operations
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.