Databases 15 min read

How ZanDB Automates MySQL Operations at Scale: A Deep Dive

ZanDB is Youzan's comprehensive MySQL automation platform that standardizes OS and database configurations, introduces a web‑based UI, task scheduling, backup monitoring, host and instance management, log analysis, metadata services, and high‑availability features to dramatically reduce manual DBA work and improve reliability.

Efficient Ops
Efficient Ops
Efficient Ops
How ZanDB Automates MySQL Operations at Scale: A Deep Dive

1. Introduction

Youzan, a leading SaaS provider for new‑retail, has grown from dozens of merchants to three million, spanning retail, beauty, catering, and media, causing explosive traffic growth and a massive increase in server and DB instance counts.

This surge created challenges such as rapid instance provisioning, slow‑query optimization, backup and recovery management, and the inefficiency of using Excel as a CMDB.

The article presents ZanDB, Youzan's in‑house database automation platform, designed to address these challenges.

2. Automation Preparation

2.1 Standardization

Standardization is the foundation for scaling operations. Youzan defined OS‑level standards (RAID5 disks, WB write‑back cache, deadline I/O scheduler, SSD optimizations) and database‑level standards (uniform directory layout, per‑instance configuration files, consistent MySQL versions, and unified parameters).

These standards were applied over two months using SaltStack to enforce software installation and file configuration.

2.2 ZanDB Technology Stack

ZanDB is built with Python Django, Percona‑Toolkit, a custom agent (servant), Celery, and a front‑end based on jQuery and Ajax. Redis is used for caching and MySQL for persistent storage.

3. Phase 1 – Backup Monitoring

Data backup is critical. The initial version replaced ad‑hoc shell scripts with a centralized backup monitoring system that provides real‑time status, execution duration, and five‑day statistics, enabling DBAs to quickly detect failures and trigger alerts.

4. Phase 2 – Full‑Feature Automation

ZanDB adopts a B/S architecture with a Go‑based agent (servant) on database servers. The system is divided into seven modules: metadata management, backup management, instance management, host management, task management, log management, and daily maintenance.

4.1 Task System

The task scheduler coordinates backup, metadata collection, instance provisioning, and other operations. It supports time‑based (minute, hour, day, week, month) and interval‑based recurring tasks, eliminating crontab scripts and allowing dynamic adjustments.

4.2 Backup Subsystem

Backups use Percona XtraBackup, compression, and rsync to remote storage. Python rewrites the backup scripts, adds API callbacks for status, and sends alerts on failures, integrating with the task system to remove crontab dependencies.

4.3 Host Management

Host metadata (IP, location, memory, disk) is refreshed via Zabbix/Open‑Falcon APIs, enabling capacity planning and proactive alerts for low‑space situations.

4.4 Instance Management

Supports multi‑instance hosts, instance listing, creation of master‑slave pairs, schema splitting, daily consistency checks, and snapshotting of instance metrics for historical analysis.

4.5 Log Management

Collects slow‑query logs and killed‑SQL logs, provides Top‑N displays, and triggers alerts when thresholds are exceeded. Logs are parsed with pt‑query‑digest and presented with execution plans and table statistics.

4.6 Metadata Management

Manages binlog metadata, primary‑key overflow checks, and shard‑lookup services, allowing rapid identification of the instance responsible for a given database/table.

4.7 Daily Maintenance

Automates low‑frequency, high‑cost manual tasks such as batch parameter queries, batch configuration changes, emergency binlog recovery, and SQL execution (DML prohibited).

4.8 Data Operations

Aggregated instance metrics feed trend charts for space and memory utilization and cost‑allocation dashboards to aid resource planning.

4.9 High‑Availability Management

Initial HA used keepalived + VIP, which suffered from disk I/O jitter and ARP limits. The second generation employs a Go‑based HA manager (hamster) with cluster health checks, active/passive failover via relay‑log or GTID, and a proxy layer, eliminating VIP‑related issues and supporting dual‑datacenter disaster recovery.

5. Outlook

ZanDB currently automates about 70 % of manual DBA work; future goals include sub‑second monitoring, log auditing, instance inspection, horizontal scaling, performance diagnostics, and automated slow‑query analysis to further increase developer productivity.

high availabilityTask SchedulingDatabase OperationsZanDBBackup monitoringMySQL automation
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.