Databases 21 min read

How NetEase Scaled DBA Automation: From Manual Ops to Self‑Service Platforms

From 2015 to now, NetEase’s DBA team transformed from manual maintenance to a fully automated, platform‑driven system that supports multi‑active database architectures, real‑time monitoring, automated alarm handling, MHA management, resource pooling, rapid migration, and self‑service SQL review, dramatically reducing downtime and operational overhead.

Efficient Ops
Efficient Ops
Efficient Ops
How NetEase Scaled DBA Automation: From Manual Ops to Self‑Service Platforms
Self‑Introduction I am Cai Peng, joined Ele.me in 2015, witnessed its business and technology growth from zero to one, participated in the rapid development of the database and DBA team, and transitioned from an operations DBA to a DEV‑DBA, focusing on empowering both DBA and DEV teams.

Over the past years, we progressed annually from manual operations to tooling, then platformization, and finally self‑service, completing all iterations within two and a half years; the platform and self‑service plus multi‑active database transformation were accomplished in just eight months.

Simultaneously, our database architecture evolved from traditional master‑slave to geographically distributed multi‑active setups, presenting huge challenges for DBAs that the platform must address.

Traditional DBA methods cannot handle the complexity of multi‑active, large‑scale management, making platformization essential.

As platformization advanced, DBA roles shifted from heavy maintenance to business‑focused value creation.

Overall Functional Overview

DB‑Agent: data collection, process management, remote script and Linux command execution, platform integration.

MM‑OST: a non‑intrusive DDL system based on gh‑ost for multi‑active database releases.

Tinker: Go rewrite of Linux crontab with second‑level granularity and integrated management interfaces.

Checksum: cross‑data‑center consistency checking.

SqlReview: Go implementation of an Inception‑like SQL review tool with enhanced features.

Luna: optimized alarm system for large‑scale instances, reducing noise while preserving critical alerts.

VDBA: automated alarm handling system that replaces manual DBA intervention for online DB incidents.

Real‑Time Monitoring & Rapid Troubleshooting

Typical DBA incident handling involves logging into servers and manually executing commands, taking at least two minutes. Our platform automates this process, delivering symptom and cause instantly, crucial when a minute of outage costs tens of thousands of orders.

The monitoring dashboard shows the status of all instances, highlighting anomalies and providing one‑click execution of common diagnostics such as execution plan analysis, lock analysis, SQL latency distribution, historical trends, and processlist snapshots.

Alarm Handling Automation

Automatic space issue resolution.

Uncommitted transaction handling.

Automatic kill of long‑running queries.

CPU/connection/thread overload analysis and mitigation.

Lossless replication repair (error codes 1032, 1062).

Instead of skipping problematic replication entries, we parse binlogs for precise, lossless repairs, avoiding the hidden risks of traditional skip‑based fixes.

MHA Automation Management

Pre‑MySQL 8.0 high‑availability relied on MHA, which required complex SSH‑based deployment and introduced reliability concerns. We replaced it with an agent‑driven approach.

Each db‑agent implements interfaces such as GetDBTopology(), BuildMHAConfig(), WriteRsaPublicKey(), StartMHA(), MHAProcessMonitor(), InspectMHAConfigIsOK(), StopMHA(), and SwitchMHA(). The platform sequentially invokes these APIs, completing full MHA setup and management in seconds (previously 2‑10 minutes).

Resource Pool & One‑Click Installation

Previously, scaling required manual scripts on dozens of machines, leading to inconsistent implementations. Now DBAs simply check resource pool availability; agents handle installation and configuration automatically.

Scaling & Migration

From 2015‑2016 we migrated over 3,000 clusters across CDB, RDS, and our own disaster‑recovery system, each undergoing two to three migrations. Automated scripts reduced a two‑week, 300‑cluster migration to two days, and a single‑person one‑hour script now drives full migration via the scheduling cluster.

Mis‑Operation Rollback

We have performed four rapid rollbacks for online mishaps. Open‑source tools were deemed unsuitable due to command‑line complexity and lack of UI integration, so we built a Go service using github.com/siddontang/go‑mysql/replication to parse binlogs across sharded tables efficiently.

Task Scheduling

We rewrote Linux crontab in Go with second‑level precision, added management modules, and exposed the scheduler as a service for platform integration, including logging, exit‑code capture, and error‑code mapping for clearer troubleshooting.

SqlReview

Initially based on Inception, we needed custom rules, so we adopted TiDB’s parser and built a Go‑based review system covering all Inception checks plus extensions such as redundant index detection, enum index validation, primary/unique index restrictions in composite indexes, mandatory auto‑increment primary keys, index count limits, column ordering rules, index bloat prevention, varchar length warnings, and naming conventions.

Risk Interception & Release Controls

Deletion of indexes/columns evaluated against metadata usage.

Prohibited column drops.

Modify operations checked for data loss (e.g., TEXT→VARCHAR).

Cross‑database operations blocked.

All TRUNCATE/DROP statements forbidden.

Built‑in Compliance Checks

Large‑field usage constraints.

Database, table, and index naming standards.

Multi‑active required field and attribute verification.

Historical Validation Data

Aggregated audit results reveal which teams commit the most violations, enabling data‑driven improvement without reliance on manual training.

Multi‑Active Release System

Our multi‑active architecture uses DRC for cross‑data‑center synchronization, but DRC does not support DDL. We therefore employ gh‑ost (Go implementation) for online schema changes, avoiding trigger‑based lock contention.

Workflow: create temporary table → apply DDL on it → register binlog listeners → apply events to temp table → copy data from original table → coordinated cut‑over after all gh‑ost instances finish copying.

We added a coordinator to synchronize cut‑over across data centers, ensuring sub‑second inter‑region latency and preventing DRC from ingesting gh‑ost‑generated binlogs.

Automation now enables developers to self‑service 95 % of release requests without DBA intervention, marking a shift toward AIOps‑driven DBA operations.

Note: This article is compiled from Cai Peng’s presentation at the 728 Database Salon.
Real-time Monitoringmulti-active architecturedatabase platformresource poolingSQL reviewDBA automationMHA management
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.