Databases 14 min read

From Automation to Intelligent Database Operations: Meituan DBA Team’s Journey and Practices

Meituan’s DBA team progressed from simple scripting through tooling, productization, private‑cloud self‑service and full automation to an emerging intelligent‑operation model that leverages extensive data collection, risk‑driven pre‑warning, AI‑assisted analysis, and self‑healing mechanisms to meet rapid growth, stability and scalability demands.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
From Automation to Intelligent Database Operations: Meituan DBA Team’s Journey and Practices

Background In recent years, traditional database operation methods have struggled to meet business demands for stability, availability, and flexibility. Rapid growth in database scale and the emergence of NewSQL systems have outpaced manual operations, prompting Meituan’s DBA team to transition from manual, to tool‑based, productized, self‑service, and finally automated operations, while exploring intelligent operation concepts.

The article outlines the evolution of Meituan’s database platform, the current state, challenges faced, and the team’s thinking, exploration, and practice when moving from automation to intelligent operations.

Database Platform Evolution The platform has gone through five major stages:

1. Scripting stage – small team, few clusters, low traffic; scripts were sufficient.

2. Tooling stage – scripts were packaged into tools, CMDB asset management, monitoring, and utilities such as DDL change, SQL review, slow‑query analysis, backup/flashback tools.

3. Productization stage – tools were assembled into repeatable processes, forming products that standardize DBA actions, improve usability and security, and reduce incidents.

4. Private‑cloud platform stage – to meet rapid business growth, many operations were opened to developers for self‑service (e.g., schema changes, query execution, account provisioning, monitoring, data masking, peak‑off‑peak definitions, log access).

5. Automation stage – moving from semi‑automatic Web‑based operations to fully automated processes (e.g., MySQL high‑availability, self‑protection, capacity diagnosis, auto‑scaling).

Current Situation and Challenges The platform now supports a wide range of functions for relational RDS, including HA, MGW management, DNS changes, backup, upgrade, traffic switching, account management, data archiving, and asset flow. It is divided by user dimension (self‑service RDS, DBA management, test environment), by function (operations, monitoring), and by storage type (MySQL, distributed KV cache, distributed KV store, NewSQL under construction). The goal is a one‑stop service platform for MySQL, NoSQL, and NewSQL.

Challenge 1: Root‑cause定位难 Fault localization inside the database requires deep expertise. Delays in alarm handling can miss the optimal response window, prompting strategies such as fast master‑switch, automatic isolation of faulty replicas, and prevention of recurring issues.

Challenge 2: 人力和发展困境 Rapid traffic growth outpaces linear staffing growth. DBA work becomes fragmented, repetitive, and recruitment is difficult. The team must consider how to break through these constraints and move toward intelligent operations.

From Automation to Intelligent Operations Traditional operations are reactive (fault‑triggered) and manual, whereas intelligent operations are proactive (risk‑driven), using extensive data collection, analysis, and automated decision‑execution. The aim is to increase the proportion of pre‑warning over alarm, reducing the need for human intervention.

Data collection covers four database‑side sources (Global Status, Variables, Processlist, InnoDB Status, logs, binlog), application‑side metrics (success rate, latency percentiles, error logs, throughput), system‑level metrics (second‑level sampling, OS stats), and change‑side events (topology adjustments, online DDL/DML, platform operation logs, deployment records).

Data analysis proceeds from cluster‑level to instance‑level to table‑level, enabling comparative metrics (year‑over‑year, month‑over‑month) and supporting resource planning, capacity scaling, storage‑KV decisions, budgeting, and targeted fault remediation.

Pre‑warning transforms alerts into actionable insights. Examples include latency‑related alerts prompting CPU upgrades, disk‑space alerts guiding storage expansion, and slow‑query “red‑black” lists identifying unauthorized access patterns for service governance.

Operational automation includes fast failover, automatic configuration generation, monitoring enablement, replica recovery, rule‑based alarm handling, and gradual reduction of manual involvement.

Outlook Future work includes building a fault‑diagnosis platform (similar to “Bian Que”) for log collection, storage, and analysis, providing APIs for end‑to‑end fault localization and service governance. Intelligent operations will further integrate AI, Big Data, and Cloud Computing, blurring the lines between SQL and NoSQL, and moving toward self‑discovering, self‑diagnosing, and self‑healing database services.

Author Bio Zhao Yinggang, Meituan researcher and database expert, with 10 years of experience in database automation, performance optimization, large‑scale cluster assurance, and architecture optimization across companies such as Baidu, Sina, and Qunar.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Database operationsDBAMeituanintelligent-opsPlatform Evolution
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.