Databases 17 min read

How Meituan’s Database Capacity Assessment System Boosts Stability and Efficiency

Meituan’s Database Capacity Assessment System uses online traffic replay, accelerated load testing, and automated analysis to safely evaluate and optimize database read/write capacity, providing real‑time operation insights, flexible scaling, and reliable change‑risk mitigation for large‑scale production environments.

dbaplus Community
dbaplus Community
dbaplus Community
How Meituan’s Database Capacity Assessment System Boosts Stability and Efficiency

Project Background

Databases are the core foundation of business systems, and as Meituan’s services scale, the demand for stable database performance has become critical. The team faced challenges such as inaccurate capacity estimation during peak events and high‑risk changes that could cause production incidents.

Project Goals

Data operation safety : Ensure replay and evaluation do not affect the normal operation of online clusters.

Realistic evaluation results : Use fully simulated traffic and environments to accurately measure read/write capacity.

Flexible and efficient system : Provide automated, plugin‑based capabilities that can be quickly integrated into various systems.

Capability Overview

Traffic replay platform : Provides UI for replay, capacity exploration, operation, and budget calibration.

External empowerment : OpenAPI allows external services to invoke replay capabilities.

Core functions : Traffic replay, capacity exploration, and capacity operation form an end‑to‑end capacity management chain.

Modular foundation : Decoupled modules improve operability of traffic replay.

Replay type support : Any SQL‑compatible database can be integrated with minimal adaptation.

Traffic Replay

The system records online traffic, replays it in an isolated sandbox, and evaluates cluster performance, providing a safe environment for change verification.

Traffic Collect

Data collection: The in‑house MTSQL kernel captures full SQL text, transaction ID, and execution time, sending data via rds‑agent to Kafka.

Data cleaning: Flink processes the Kafka stream, filtering SQL for registered replay clusters.

Data storage: Structured SQL data is written to ClickHouse for efficient large‑scale handling.

Traffic Process

Data aggregation: sql‑log agent scans ClickHouse and aggregates SQL per primary/replica role.

Data processing: Primary role aggregates transactions; replica role keeps SQL independent to simulate high‑concurrency reads.

Data persistence: Processed data is stored as traffic files in S3 for replay consumption.

Traffic Replay

The replay‑agent runs as a Kubernetes job, consuming traffic files from S3, controlling replay speed, and executing SQL in the sandbox.

Data Analyze & Report

Collect runtime metrics (CPU, load, slow queries, replication lag).

Gather SQL execution times.

Capture MySQL configuration and instance details.

Capacity assessment system overview
Capacity assessment system overview

Capacity Exploration

To determine the maximum QPS a cluster can sustain, the system replays traffic at accelerated speeds in a sandbox, gradually increasing load until alerts trigger.

Fast exploration phase : Exponential speed increase quickly identifies the approximate capacity boundary.

Precise calibration phase : Binary search between the last safe speed and the first alert‑triggering speed refines the exact limit.

Accelerated replay compresses SQL execution intervals, raising request density. The replay agent schedules each SQL with a coroutine, calculating a planned execution time ( planDelta) based on the desired speed and comparing it to the real elapsed time ( realPassDelta) to decide whether to wait or execute immediately.

SQL replay speed control diagram
SQL replay speed control diagram

Capacity Operation

The operation module provides real‑time visibility of cluster capacity and automated recommendations.

Evaluation hosting : Core clusters with stable traffic models are registered for periodic capacity exploration.

Capacity calculation : Computes usage water‑level (online QPS / explored max QPS) and suggests scaling actions.

Automated ops : One‑click workflows trigger capacity‑adjustment tasks, with traceable, observable, and rollback‑able change pipelines.

Capacity operation dashboard
Capacity operation dashboard

Future Plans

Support more database types (MGR, Proxy, Blade, Elasticsearch).

Integrate with budgeting systems to expose hidden resource redundancy.

Introduce intelligent capacity tuning that automatically analyses bottlenecks and validates optimisations.

Build a case‑library of typical promotion and incident scenarios with data snapshots for chaos engineering.

Automationtraffic replaycapacity planningMeituan
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.