Operations 18 min read

Automating TiDB Operations at ZuanZuan: From Manual Management to Platform‑Based Automation

This article details ZuanZuan's journey of automating TiDB operations, covering the initial operational pain points, the implementation of metadata and resource management, comprehensive upgrades, alarm redesign, and the development of a work‑order‑driven platform that streamlines node, scaling, decommission, and monitoring tasks while significantly reducing manual effort and costs.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Automating TiDB Operations at ZuanZuan: From Manual Management to Platform‑Based Automation

ZuanZuan, one of the earliest users of TiDB, shares its experience of evolving from manual DBA operations to a fully automated platform, aiming to turn every operational request into a work order and every action into a platform operation to reduce labor costs.

Operational Pain Points included fragmented cluster management across 30+ clusters with ~500 TiKV nodes, reliance on Ansible scripts that often timed out, difficulty locating nodes during batch operations, frequent manual migrations, resource imbalance, noisy duplicate alerts, and limited visibility into cluster status and backups.

Solutions were introduced in four areas:

Metadata Management – all node and component information is stored in a centralized table, regularly refreshed to provide a global view for resource scheduling and capacity alerts.

Machine Resource Management – hardware usage is collected per machine, enabling automated rebalancing and more efficient resource allocation.

Comprehensive Upgrade – all 2.1 clusters were upgraded to 4.0.13 with standardized port naming (component‑ID) and dedicated domain names for each service, improving manageability and performance (30‑50% boost).

Alarm Redesign – alerts now support SMS, voice, convergence, suppression, and escalation across media, recipients, and time windows, cutting alert volume by at least 60%.

Automation Implementation focused on three pillars:

Work‑order Automation

Requests such as cluster deployment, data recovery, TiCDC, and TiFlash are handled via a ticket system, automating steps like topology generation, component configuration, user creation, and information synchronization.

Platform‑based Operations

Node management (start, stop, reload, decommission, maintenance) is performed through a web UI with safeguards for quorum requirements and silent periods. Scaling operations support both manual and automatic expansion, with configurable target addresses. Decommissioning enforces serial node removal and data migration completion. Alarm management offers pre‑configured silencing, one‑click silence, and rule‑based alerts for slow queries.

Auxiliary Functions

Process‑level monitoring captures CPU, memory, disk I/O, and network usage per instance, aiding troubleshooting. Trend monitoring tracks data growth to trigger capacity warnings. Automatic operations include self‑adaptive migration when memory or disk thresholds are reached and auto‑scaling when a TiKV store exceeds 800 GB.

The following table illustrates the standardized port and domain scheme for a sample cluster (ID 001):

Role

Count

Port

Deploy Path

Domain

Remark

pd

3

13001/14001

/path/pd-13001

tdb.13001.com

dashboard domain

tidb

3

15001/16001

/path/tidb-15001

tdb.15001.com

external service domain

tikv

3

17001/18001

/path/tikv-17001

alertmanager

1

21001

/path/alertmanager-21001

tdb.21001.com

alertmanager domain

prometheus

1

19001

/path/prometheus-19001

tdb.19001.com

prometheus domain

grafana

1

20001

/path/grafana-20001

tdb.20001.com

grafana domain

exporter

n

11001/12001

/path/monitor-11001

deployed on every machine

In conclusion, ZuanZuan's automation journey transformed TiDB operations by standardizing metadata, balancing resources, upgrading clusters, redesigning alerts, and building a comprehensive platform that reduces manual effort, improves reliability, and provides a foundation for future scalability.

MonitoringAutomationoperationsplatformTiDBDatabase Management
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.