Databases 16 min read

How Meituan Built a Real‑Time Database Capacity Assessment System

Meituan's database team created a sandbox‑based capacity assessment platform that replays live traffic, uses accelerated replay to discover performance bottlenecks, and closes the loop with capacity monitoring and automated operations, dramatically improving stability and resource utilization.

ITPUB
ITPUB
ITPUB
How Meituan Built a Real‑Time Database Capacity Assessment System

Project Background

Database stability is critical for large‑scale services, but evaluating read/write limits and preventing risky changes are difficult. Traditional metric‑based calculations and full‑link load testing often give inaccurate capacity predictions and incur high test‑environment costs.

Project Goals

Data‑operation safety: replay and evaluation must not affect production clusters.

Realistic assessment: use fully simulated traffic and environments.

Flexible, high‑efficiency system that can be integrated into various services.

Capability Overview

The system consists of a traffic‑replay platform, external OpenAPI empowerment, core functions (traffic replay, capacity probing, capacity operation), and modular foundational capabilities.

Traffic Replay

Live SQL traffic is recorded, cleaned, and stored in ClickHouse. The replay pipeline has three stages:

Traffic Collect : MTSQL kernel captures full SQL, transaction ID, and latency; rds‑agent sends data to Kafka.

Traffic Process : Flink cleans the stream, filters for registered clusters, and writes structured data to ClickHouse.

Traffic Replay : A replay‑agent consumes traffic files from S3, controls replay rhythm, and replays SQL into an isolated sandbox cluster.

During replay, metrics such as CPU busy, load average, slow queries, replication lag, and per‑SQL execution time are collected.

Capacity Exploration (Capacity Up‑Probe)

Real traffic from peak periods is replayed at accelerated speeds in a sandbox to find the maximum sustainable QPS. The process iterates through a fast‑probe phase (exponential speed increase) and a precise‑calibration phase (binary search between the last triggered and non‑triggered speeds) until a predefined accuracy is reached.

Key formulas:

Original offset: originalDelta = sqlinfo.BeginAt - sqlBaseTime Adjusted offset: planDelta = originalDelta / replaySpeed During replay, the system compares planDelta with the real elapsed time realPassDelta to decide whether to wait or execute immediately, ensuring accurate speed control.

Capacity Operation

The operation service provides real‑time visibility of cluster capacity, automatically manages assessment tasks, calculates usage watermarks, and generates scaling recommendations.

Assessment Management : Registers stable‑traffic clusters, schedules periodic up‑probe jobs, and refreshes results when configuration changes.

Capacity Calculation : Computes usage watermarks (online QPS / evaluated max QPS) and suggests shrink or expansion actions.

Automated Operations : One‑click workflow triggers scaling actions and includes risk‑control mechanisms (traceability, observability, rollback).

Future Plans

Support additional database engines (MGR, Proxy, Blade, Elasticsearch).

Integrate with budgeting systems to expose resource redundancy.

Introduce intelligent capacity tuning with automated re‑testing.

Build a case library of typical scenarios (promotions, incidents) for chaos engineering and long‑term stability studies.

Automationtraffic replayperformance testingMySQLcapacity planningDatabase Capacity
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.