Automating TiDB Operations at ZuanZuan: From Manual Management to Platform‑Based Automation
This article details ZuanZuan's journey of automating TiDB operations, covering the initial operational pain points, the implementation of metadata and resource management, comprehensive upgrades, alarm redesign, and the development of a work‑order‑driven platform that streamlines node, scaling, decommission, and monitoring tasks while significantly reducing manual effort and costs.
ZuanZuan, one of the earliest users of TiDB, shares its experience of evolving from manual DBA operations to a fully automated platform, aiming to turn every operational request into a work order and every action into a platform operation to reduce labor costs.
Operational Pain Points included fragmented cluster management across 30+ clusters with ~500 TiKV nodes, reliance on Ansible scripts that often timed out, difficulty locating nodes during batch operations, frequent manual migrations, resource imbalance, noisy duplicate alerts, and limited visibility into cluster status and backups.
Solutions were introduced in four areas:
Metadata Management – all node and component information is stored in a centralized table, regularly refreshed to provide a global view for resource scheduling and capacity alerts.
Machine Resource Management – hardware usage is collected per machine, enabling automated rebalancing and more efficient resource allocation.
Comprehensive Upgrade – all 2.1 clusters were upgraded to 4.0.13 with standardized port naming (component‑ID) and dedicated domain names for each service, improving manageability and performance (30‑50% boost).
Alarm Redesign – alerts now support SMS, voice, convergence, suppression, and escalation across media, recipients, and time windows, cutting alert volume by at least 60%.
Automation Implementation focused on three pillars:
Work‑order Automation
Requests such as cluster deployment, data recovery, TiCDC, and TiFlash are handled via a ticket system, automating steps like topology generation, component configuration, user creation, and information synchronization.
Platform‑based Operations
Node management (start, stop, reload, decommission, maintenance) is performed through a web UI with safeguards for quorum requirements and silent periods. Scaling operations support both manual and automatic expansion, with configurable target addresses. Decommissioning enforces serial node removal and data migration completion. Alarm management offers pre‑configured silencing, one‑click silence, and rule‑based alerts for slow queries.
Auxiliary Functions
Process‑level monitoring captures CPU, memory, disk I/O, and network usage per instance, aiding troubleshooting. Trend monitoring tracks data growth to trigger capacity warnings. Automatic operations include self‑adaptive migration when memory or disk thresholds are reached and auto‑scaling when a TiKV store exceeds 800 GB.
The following table illustrates the standardized port and domain scheme for a sample cluster (ID 001):
Role
Count
Port
Deploy Path
Domain
Remark
pd
3
13001/14001
/path/pd-13001
tdb.13001.com
dashboard domain
tidb
3
15001/16001
/path/tidb-15001
tdb.15001.com
external service domain
tikv
3
17001/18001
/path/tikv-17001
alertmanager
1
21001
/path/alertmanager-21001
tdb.21001.com
alertmanager domain
prometheus
1
19001
/path/prometheus-19001
tdb.19001.com
prometheus domain
grafana
1
20001
/path/grafana-20001
tdb.20001.com
grafana domain
exporter
n
11001/12001
/path/monitor-11001
deployed on every machine
In conclusion, ZuanZuan's automation journey transformed TiDB operations by standardizing metadata, balancing resources, upgrading clusters, redesigning alerts, and building a comprehensive platform that reduces manual effort, improves reliability, and provides a foundation for future scalability.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.