How Meituan’s Database Capacity Evaluation System Boosts Stability and Efficiency
Meituan’s database team introduced a capacity evaluation system that uses online traffic replay in sandbox environments, accelerated replay to uncover performance bottlenecks, and a capacity‑operation loop to monitor and manage cluster resources, thereby improving database stability, safety of changes, and resource utilization.
01 Project Background
Databases are the core foundation of business systems, and their stability is critical. As business scale expands, the demand for database stability reaches unprecedented levels. The Meituan DBA team faces common pain points:
Pain point 1: During peak events, it is difficult to accurately evaluate read/write capacity limits, leading to capacity shortage and production incidents.
Typical capacity assessment methods include metric calculation and end‑to‑end pressure testing.
Metric calculation compares load metrics with thresholds, but traffic and metrics are not strictly linear, reducing prediction accuracy.
Full‑link pressure testing records upstream traffic and replays it to detect bottlenecks, yet traffic realism can be affected by scenario complexity and sample richness, and integrating services into testing incurs additional cost.
Pain point 2: Database changes are a major source of online incidents, and risky changes are hard to identify.
Common solutions are fixed‑rule interception and offline testing.
Fixed‑rule interception integrates rule sets into the change platform, but routine changes such as adding indexes or modifying column types often bypass interception, potentially degrading performance.
Offline testing runs changes in a test environment, yet data differences and insufficient testing can miss risks.
02 Project Goals
Build a database capacity evaluation system that replays real online traffic in a sandbox, establishing a complete capacity assessment framework to provide scientific data support for capacity planning.
1. Data operation safety: Replay and evaluation must not affect the normal operation of online clusters.
2. Evaluation result authenticity: Use fully realistic traffic and environments to ensure accurate read/write capacity assessment.
3. System flexibility and efficiency: Provide high‑efficiency automation and plug‑in architecture for rapid integration into various systems.
03 Capability Panorama
Testing platform: Provides traffic replay, capacity exploration, capacity operation, and budget calibration interfaces.
External empowerment: OpenAPI allows external services to invoke replay capabilities.
Core functions: Traffic replay, capacity exploration, and capacity operation form a full‑link capacity management capability.
Basic capabilities: Modular architecture decouples functions, enhancing operability.
Replay types: Supports any SQL‑protocol target after appropriate adaptation.
04 Traffic Replay
The system records online traffic, replays it in an isolated sandbox, and evaluates cluster performance, providing a safe environment for change verification.
4.1 Business Scenario
4.2 Architecture Design
The replay pipeline consists of four core processes:
Traffic Collect: Captures full SQL text, transaction ID, and execution time via the custom MTSQL kernel, sending data to Kafka.
Data Clean: Flink consumes the Kafka stream, filtering SQL for registered replay clusters.
Data Store: Structured SQL is written to ClickHouse for efficient large‑scale processing.
Traffic Process: Aggregates, processes, and persists traffic files for replay.
Traffic Replay: Replay‑agent consumes traffic files from S3, controls replay rhythm, and streams SQL to the sandbox cluster.
Data Analyze & Report: Collects CPU, load, slow query, replication delay, SQL execution time, and MySQL parameters to provide comprehensive observability.
4.3 Effect Demonstration
The replay report shows top‑changing SQL statements and performance improvements, confirming that enabling table compression increased storage efficiency without noticeable latency impact.
05 Capacity Exploration
To determine the maximum QPS a cluster can sustain, the system replays traffic at accelerated speeds in a sandbox.
5.1 Business Scenario
5.2 Process Implementation
The system captures real peak‑period traffic, then runs multiple accelerated replay rounds, gradually increasing load until the cluster’s alert threshold is triggered.
Fast Exploration Phase: Exponential speed increase quickly locates the alert boundary.
Precise Calibration Phase: Binary search between the last non‑alert and first alert speeds refines the maximum sustainable QPS.
5.3 Accelerated Replay
Replay speed is increased by compressing SQL execution intervals, raising request density.
Within the replay agent, the CalcSendTime function adjusts the original time offset ( originalDelta) by the replay speed to compute a planned offset ( planDelta).
If planDelta ≥ realPassDelta, the coroutine waits; otherwise it executes immediately, ensuring accurate timing control.
5.4 Effect Demonstration
The report shows that after reducing machine specs, the cluster only triggered alerts at >6× traffic, comfortably meeting the 2× capacity requirement, leading to a successful down‑size.
06 Capacity Operation
To give DBAs a real‑time view of cluster capacity, a capacity‑operation service was built.
6.1 Business Scenario
6.2 Operation Design
The service integrates three core modules:
Evaluation Hosting: Registers stable‑traffic clusters for periodic capacity exploration and retains results with expiration.
Capacity Calculation: Computes usage water‑level (online QPS / evaluated max QPS) and provides scaling suggestions.
Automated Operations: One‑click workflow applies capacity recommendations and includes traceable, observable, rollback‑capable change pipelines.
6.3 Effect Demonstration
The operation dashboard displays read/write water‑levels and suggestions; clusters with risk were automatically expanded, simplifying the evaluation workflow and greatly improving efficiency.
07 Future Plans
Support more database types such as MGR, Proxy, Blade, and Elasticsearch.
Integrate with budgeting systems to expose resource redundancy and reduce over‑budgeting.
Introduce intelligent capacity tuning that automatically analyzes bottlenecks, generates optimization plans, and validates them.
Build a case library preserving snapshots, traffic, and configurations for chaos engineering and long‑term MySQL stability.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
