Operations 22 min read

Automating TiDB Operations: From Manual Pain Points to a Scalable Platform

This article details how Zhaozhuan's DBA team transformed TiDB cluster management by addressing metadata, resource allocation, upgrade, and alert challenges through a comprehensive automation platform that streamlines work orders, node operations, scaling, monitoring, and alert handling, ultimately reducing manual effort and improving reliability.

ITPUB
ITPUB
ITPUB
Automating TiDB Operations: From Manual Pain Points to a Scalable Platform

Operational Challenges

More than 30 TiDB clusters (~500 TiKV nodes) required manual machine selection for new clusters, scaling, and migration, causing duplicated effort and low efficiency.

Operations on TiDB 2.1 were driven by Ansible; time‑outs and lack of visibility made batch actions error‑prone.

TiDB 2.1 suffered from plan invalidation, hotspot queries, OOM on large scans, optimistic‑transaction bugs, incomplete monitoring, and cumbersome backup procedures.

Resource imbalance: some hosts had excess memory but insufficient disk space and vice‑versa, leading to perceived over‑provisioning.

High alarm noise with duplicate alerts reduced troubleshooting effectiveness.

Technical Solutions

Metadata Management

All node and component information is stored in a central relational table. A scheduled collector refreshes node status, resource usage, and component health, providing a global view for capacity planning. Example: a TiKV capacity limit of 500 GB triggers an alert for expansion, preventing uncontrolled disk growth.

Machine‑Level Resource Management

Hardware metrics (CPU, memory, disk I/O) are persisted per host. The data drives automated rebalancing and scheduling, improving overall utilization and reducing wasted machine resources by roughly 15 % .

Metadata and resource tables are the foundation for health checks and monitoring.

Full Upgrade to TiDB 4.0.13

All TiDB 2.1 clusters were upgraded to version 4.0.13. Prior to upgrade a port‑naming convention was introduced: {component‑code}{three‑digit‑cluster‑ID}. For cluster 001 the ports are:

PD: 13001/14001

TiDB: 15001/16001

TiKV: 17001/18001

Alertmanager: 21001

Prometheus: 19001

Grafana: 20001

Exporters: 11001/12001

Each cluster runs independent monitoring components to avoid cross‑cluster alert contamination, and five domain names map to TiDB services, Dashboard, Grafana, Prometheus, and Alertmanager for quick lookup.

Post‑upgrade performance increased by 30‑50 % and data‑extraction‑related outages were eliminated.

Alarm Refactoring

Implemented multi‑channel alerts (SMS, voice) with convergence, suppression, and escalation. Noise was reduced by at least 60 % using a three‑level escalation chain (email → WeChat → SMS → phone) and recipient hierarchy (Level 1 → Level 2 → Leader), plus time‑based escalation (working hours vs. off‑hours).

Automation Platform

Work‑Order System

Daily DBA tasks are encapsulated as work orders:

Cluster Deployment : Generates topology files, selects machines per role, creates business users and domains, and registers metadata.

Data Recovery : Supports fast snapshot‑based recovery within the GC window and backup‑file recovery for older points; options include whole‑database, single‑table, or multi‑table restores.

TiCDC Extraction : Preferred path for feeding business data to a big‑data platform; TiFlash is used as a fallback. Configuration updates are handled via work orders.

TiFlash Extraction : Similar to TiCDC but incurs additional storage cost.

Platform‑Based Operations

All routine actions are performed through a web UI, providing audit trails and reducing human error.

Node Management : Start, stop, reload (instead of restart), decommission, and maintenance actions with safety checks (e.g., raft quorum requirements, forced decommission for dead nodes).

Scaling : Manual and automatic role expansion; if no address is specified the system auto‑assigns a suitable target.

Decommission : Serial node removal ensures data migration and prune steps complete before proceeding; a forced option handles unrecoverable failures.

Alarm Management : Configure silent periods (max 24 h, default 2 h), mute individual alerts, bulk mute, and define escalation rules.

Slow‑Query Alerts : Trigger when query latency exceeds user‑defined thresholds; cluster‑level thresholds are automatically adjusted.

Additional Helper Functions

Process Monitoring : Captures per‑process CPU, memory, disk I/O, and network usage to aid root‑cause analysis.

Trend Monitoring : Tracks data growth; sends alerts when daily increase > 20 GB for three consecutive days or monthly increase > 200 GB, and when total size crosses defined thresholds.

Automatic Operations : Self‑adaptive migration when memory or disk thresholds are hit; auto‑scaling when a TiKV reaches 800 GB .

Automatic migration and scaling reduce DBA workload, lower overall alarm volume, and improve system performance.

Implementation Details

Work‑Order Types

Supported work‑order categories include:

Cluster deployment – selects appropriate machines for each component (PD, TiDB, TiKV, etc.), creates topology files, initializes the cluster, and registers metadata.

Data recovery – snapshot‑based (fast, within GC window) and backup‑file based (for points outside GC); supports whole‑database, single‑table, or multi‑table restores with optional query‑condition filters.

TiCDC/TiFlash extraction – creates or updates CDC tasks; for TiCDC, ensure downstream Kafka packet size matches Kafka max‑message‑bytes to avoid errors.

Node Operations

Node actions respect component constraints (e.g., PD and TiKV require a minimum of two instances). Decommission checks verify raft quorum and prevent removal of the last node of a role without DBA intervention.

Scaling and Decommission

Scaling can target multiple roles simultaneously; unspecified addresses are auto‑assigned. Decommission proceeds serially, waiting for data migration and prune before removing the next node. A forced decommission option handles unrecoverable host failures.

Alarm Management UI

Users can mute individual alerts, apply bulk mute, or set silent windows (default 2 h, max 24 h). Alarm lists display active alerts for selective silencing; a one‑click “mute all” button is provided.

Slow‑Query Alert Configuration

Users define a time window and a query‑latency threshold. If the cluster’s global slow‑query threshold exceeds the user‑defined minimum, the system lowers the cluster threshold to the user value (e.g., cluster 300 ms, user 200 ms → threshold set to 200 ms).

Process & Trend Monitoring

Process‑level metrics include CPU, memory, disk I/O, and network traffic (network data is collected only when exceeding a configurable threshold). Trend alerts fire on abnormal growth patterns as described above.

Automatic Migration & Scaling

When a host approaches memory or disk alert thresholds, the platform automatically migrates data to balance load. When a TiKV store reaches 800 GB, the system triggers an automatic expansion of the TiKV pool.

Large TiKV stores (> 800 GB) increase replica creation time and data migration duration; automatic scaling mitigates these issues.

Conclusion

The case study demonstrates a transition from manual, error‑prone TiDB operations to a fully automated, work‑order‑driven platform. Core technical achievements include centralized metadata, machine‑level resource tables, standardized port naming, independent monitoring per cluster, multi‑level alarm escalation, and self‑adaptive migration and scaling. These practices improve resource utilization, reduce alarm noise, and increase overall system reliability. Implementations are provided as reference material; results may vary across environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlertingTiDBCluster Managementdatabase automation
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.