Databases 19 min read

How Meituan’s Database Capacity Evaluation System Boosts Stability and Efficiency

Meituan’s database team introduced a capacity evaluation system that uses online traffic replay in sandbox environments, accelerated replay to uncover performance bottlenecks, and a capacity‑operation loop to monitor and manage cluster resources, thereby improving database stability, safety of changes, and resource utilization.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
How Meituan’s Database Capacity Evaluation System Boosts Stability and Efficiency

01 Project Background

Databases are the core foundation of business systems, and their stability is critical. As business scale expands, the demand for database stability reaches unprecedented levels. The Meituan DBA team faces common pain points:

Pain point 1: During peak events, it is difficult to accurately evaluate read/write capacity limits, leading to capacity shortage and production incidents.

Typical capacity assessment methods include metric calculation and end‑to‑end pressure testing.

Metric calculation compares load metrics with thresholds, but traffic and metrics are not strictly linear, reducing prediction accuracy.

Full‑link pressure testing records upstream traffic and replays it to detect bottlenecks, yet traffic realism can be affected by scenario complexity and sample richness, and integrating services into testing incurs additional cost.

Pain point 2: Database changes are a major source of online incidents, and risky changes are hard to identify.

Common solutions are fixed‑rule interception and offline testing.

Fixed‑rule interception integrates rule sets into the change platform, but routine changes such as adding indexes or modifying column types often bypass interception, potentially degrading performance.

Offline testing runs changes in a test environment, yet data differences and insufficient testing can miss risks.

02 Project Goals

Build a database capacity evaluation system that replays real online traffic in a sandbox, establishing a complete capacity assessment framework to provide scientific data support for capacity planning.

1. Data operation safety: Replay and evaluation must not affect the normal operation of online clusters.

2. Evaluation result authenticity: Use fully realistic traffic and environments to ensure accurate read/write capacity assessment.

3. System flexibility and efficiency: Provide high‑efficiency automation and plug‑in architecture for rapid integration into various systems.

03 Capability Panorama

Testing platform: Provides traffic replay, capacity exploration, capacity operation, and budget calibration interfaces.

External empowerment: OpenAPI allows external services to invoke replay capabilities.

Core functions: Traffic replay, capacity exploration, and capacity operation form a full‑link capacity management capability.

Basic capabilities: Modular architecture decouples functions, enhancing operability.

Replay types: Supports any SQL‑protocol target after appropriate adaptation.

04 Traffic Replay

The system records online traffic, replays it in an isolated sandbox, and evaluates cluster performance, providing a safe environment for change verification.

4.1 Business Scenario

4.2 Architecture Design

The replay pipeline consists of four core processes:

Traffic Collect: Captures full SQL text, transaction ID, and execution time via the custom MTSQL kernel, sending data to Kafka.

Data Clean: Flink consumes the Kafka stream, filtering SQL for registered replay clusters.

Data Store: Structured SQL is written to ClickHouse for efficient large‑scale processing.

Traffic Process: Aggregates, processes, and persists traffic files for replay.

Traffic Replay: Replay‑agent consumes traffic files from S3, controls replay rhythm, and streams SQL to the sandbox cluster.

Data Analyze & Report: Collects CPU, load, slow query, replication delay, SQL execution time, and MySQL parameters to provide comprehensive observability.

4.3 Effect Demonstration

The replay report shows top‑changing SQL statements and performance improvements, confirming that enabling table compression increased storage efficiency without noticeable latency impact.

05 Capacity Exploration

To determine the maximum QPS a cluster can sustain, the system replays traffic at accelerated speeds in a sandbox.

5.1 Business Scenario

5.2 Process Implementation

The system captures real peak‑period traffic, then runs multiple accelerated replay rounds, gradually increasing load until the cluster’s alert threshold is triggered.

Fast Exploration Phase: Exponential speed increase quickly locates the alert boundary.

Precise Calibration Phase: Binary search between the last non‑alert and first alert speeds refines the maximum sustainable QPS.

5.3 Accelerated Replay

Replay speed is increased by compressing SQL execution intervals, raising request density.

Within the replay agent, the CalcSendTime function adjusts the original time offset ( originalDelta) by the replay speed to compute a planned offset ( planDelta).

If planDeltarealPassDelta, the coroutine waits; otherwise it executes immediately, ensuring accurate timing control.

5.4 Effect Demonstration

The report shows that after reducing machine specs, the cluster only triggered alerts at >6× traffic, comfortably meeting the 2× capacity requirement, leading to a successful down‑size.

06 Capacity Operation

To give DBAs a real‑time view of cluster capacity, a capacity‑operation service was built.

6.1 Business Scenario

6.2 Operation Design

The service integrates three core modules:

Evaluation Hosting: Registers stable‑traffic clusters for periodic capacity exploration and retains results with expiration.

Capacity Calculation: Computes usage water‑level (online QPS / evaluated max QPS) and provides scaling suggestions.

Automated Operations: One‑click workflow applies capacity recommendations and includes traceable, observable, rollback‑capable change pipelines.

6.3 Effect Demonstration

The operation dashboard displays read/write water‑levels and suggestions; clusters with risk were automatically expanded, simplifying the evaluation workflow and greatly improving efficiency.

07 Future Plans

Support more database types such as MGR, Proxy, Blade, and Elasticsearch.

Integrate with budgeting systems to expose resource redundancy and reduce over‑budgeting.

Introduce intelligent capacity tuning that automatically analyzes bottlenecks, generates optimization plans, and validates them.

Build a case library preserving snapshots, traffic, and configurations for chaos engineering and long‑term MySQL stability.

traffic replayPerformance Testingcapacity planningdatabases
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.