Databases 14 min read

How Didi Scales MySQL: From Manual Ops to Full Automation

This article outlines Didi's MySQL database architecture, the challenges of managing thousands of instances, and the step‑by‑step automation framework—including dbproxy, high‑availability, backup, monitoring, and deployment modules—that reduces manual DBA work by over 70%.

ITPUB
ITPUB
ITPUB
How Didi Scales MySQL: From Manual Ops to Full Automation

1. Didi DB Architecture Overview

Didi primarily uses MySQL for its ride‑hailing services, operating 3,000‑4,000 DB servers with 7‑8 k instances. The architecture consists of a TGW/LVS VIP layer, a distributed dbproxy middleware, and a MySQL master‑slave topology (one primary, one backup master, and one or more slaves). High QPS workloads may require additional slaves, and Zabbix is used for monitoring MySQL health and failures. Backup and performance‑optimization modules are also in place.

The dbproxy acts as an entry point for all DB traffic, logging normal and error requests (e.g., whitelist violations, SQL syntax errors) and intercepting them for analysis.

2. Main Operational Tasks

Daily DBA work includes deployment, ticket handling, capacity expansion, and monitoring alerts. Typical weekly ticket volume ranges from 30 to 50 new instances, requiring DDL changes, whitelist updates, and other requests. Deployment and ticket processing account for roughly 70% of effort and are prime candidates for automation.

3. Key Modules Requiring Automation

Core modules that must remain functional while being automated are high‑availability, data backup, monitoring/alerting, and online DDL systems. Previously, Didi used PT for these tasks; the current stack has shifted to ghost for online DDL.

4. Automation Modules and Workflow

The automation pipeline is divided into four layers:

Web layer : Front‑end UI for users to submit DB instance requests.

API layer : Handles actions such as creating instances, clusters, and dbproxy components, as well as initiating backups.

Scheduling layer : Built with Python and Tornado, orchestrating tasks.

Execution layer : Powered by SaltStack to run commands on target machines.

Standardization is applied to OS initialization (filesystem, kernel settings, mount points) and DB configuration (config files, deployment paths, naming conventions for instances and IDs).

Data‑link layer uses open‑source canal + Kafka + Zookeeper to re‑hash data, enabling cross‑city table aggregation for historical queries.

5. Detailed Automation Steps

When a user requests a new MySQL master‑slave instance, the system:

Collects service name, version, and port.

Determines the required number of nodes based on QPS/TPS.

Copies a template data file containing mha user credentials.

Generates a configuration file from the template, replacing placeholders (port, datadir, binlogdir, etc.).

Executes SaltStack modules to create directories, pull the appropriate MySQL version, and start the service.

Configures replication relationships after the instances are up.

If any step fails, SaltStack aborts the workflow, ensuring consistency.

6. Online DDL Process

Online DDL follows a three‑phase approach: create an empty table, modify its schema, sync historical and incremental data, then rename the table to replace the old one. Didi moved from PT‑based triggers (which doubled QPS load) to an inception+ghost solution that replays binlogs from a replica, reducing primary‑node pressure.

7. Remaining Automation Gaps

While MySQL, dbproxy, and mha provisioning are automated, several areas still need work:

Resource pool selection for new instance requests.

Automatic VIP allocation tied to dbproxy machines.

Fine‑grained monitoring alerts that notify instance owners.

Automated slow‑query analysis and optimization recommendations.

These gaps are slated for future development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsmysqlscalingDidiDBAdatabase automation
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.