How Meituan’s Database Capacity Assessment System Boosts Stability and Efficiency
Meituan’s Database Capacity Assessment System uses online traffic replay, accelerated load testing, and automated analysis to safely evaluate and optimize database read/write capacity, providing real‑time operation insights, flexible scaling, and reliable change‑risk mitigation for large‑scale production environments.
Project Background
Databases are the core foundation of business systems, and as Meituan’s services scale, the demand for stable database performance has become critical. The team faced challenges such as inaccurate capacity estimation during peak events and high‑risk changes that could cause production incidents.
Project Goals
Data operation safety : Ensure replay and evaluation do not affect the normal operation of online clusters.
Realistic evaluation results : Use fully simulated traffic and environments to accurately measure read/write capacity.
Flexible and efficient system : Provide automated, plugin‑based capabilities that can be quickly integrated into various systems.
Capability Overview
Traffic replay platform : Provides UI for replay, capacity exploration, operation, and budget calibration.
External empowerment : OpenAPI allows external services to invoke replay capabilities.
Core functions : Traffic replay, capacity exploration, and capacity operation form an end‑to‑end capacity management chain.
Modular foundation : Decoupled modules improve operability of traffic replay.
Replay type support : Any SQL‑compatible database can be integrated with minimal adaptation.
Traffic Replay
The system records online traffic, replays it in an isolated sandbox, and evaluates cluster performance, providing a safe environment for change verification.
Traffic Collect
Data collection: The in‑house MTSQL kernel captures full SQL text, transaction ID, and execution time, sending data via rds‑agent to Kafka.
Data cleaning: Flink processes the Kafka stream, filtering SQL for registered replay clusters.
Data storage: Structured SQL data is written to ClickHouse for efficient large‑scale handling.
Traffic Process
Data aggregation: sql‑log agent scans ClickHouse and aggregates SQL per primary/replica role.
Data processing: Primary role aggregates transactions; replica role keeps SQL independent to simulate high‑concurrency reads.
Data persistence: Processed data is stored as traffic files in S3 for replay consumption.
Traffic Replay
The replay‑agent runs as a Kubernetes job, consuming traffic files from S3, controlling replay speed, and executing SQL in the sandbox.
Data Analyze & Report
Collect runtime metrics (CPU, load, slow queries, replication lag).
Gather SQL execution times.
Capture MySQL configuration and instance details.
Capacity Exploration
To determine the maximum QPS a cluster can sustain, the system replays traffic at accelerated speeds in a sandbox, gradually increasing load until alerts trigger.
Fast exploration phase : Exponential speed increase quickly identifies the approximate capacity boundary.
Precise calibration phase : Binary search between the last safe speed and the first alert‑triggering speed refines the exact limit.
Accelerated replay compresses SQL execution intervals, raising request density. The replay agent schedules each SQL with a coroutine, calculating a planned execution time ( planDelta) based on the desired speed and comparing it to the real elapsed time ( realPassDelta) to decide whether to wait or execute immediately.
Capacity Operation
The operation module provides real‑time visibility of cluster capacity and automated recommendations.
Evaluation hosting : Core clusters with stable traffic models are registered for periodic capacity exploration.
Capacity calculation : Computes usage water‑level (online QPS / explored max QPS) and suggests scaling actions.
Automated ops : One‑click workflows trigger capacity‑adjustment tasks, with traceable, observable, and rollback‑able change pipelines.
Future Plans
Support more database types (MGR, Proxy, Blade, Elasticsearch).
Integrate with budgeting systems to expose hidden resource redundancy.
Introduce intelligent capacity tuning that automatically analyses bottlenecks and validates optimisations.
Build a case‑library of typical promotion and incident scenarios with data snapshots for chaos engineering.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
