How to Build a Scalable, Automated MySQL Operations Platform
This article explains how to standardize and automate MySQL management at scale, covering dedicated instance deployment, configuration consistency, multi‑instance creation, metadata collection, backup, monitoring, high‑availability with Zookeeper, and task orchestration using DBTask to achieve rapid, reliable database services.
Usage Patterns and Challenges
Two main usage patterns are common in large‑scale MySQL deployments:
Mixed (shared) deployment : a new database is created on an existing instance for speed, but applications interfere with each other, performance metrics (QPS, TPS, RT) are unavailable, root‑cause analysis is hard, and resource utilization is inefficient.
Dedicated instance per application : each application runs on its own MySQL instance, which aligns with virtualization and micro‑services, but requires a shift from single‑instance to multi‑instance management.
These patterns raise questions about fast service provisioning, resource allocation, monitoring, and high‑availability for many instances.
Standardization
One dedicated MySQL instance per application.
Master serves traffic; replicas are used only for disaster recovery.
MySQL compiled with thread_pool support.
Uniform MySQL version across all nodes.
Identical my.cnf files except for instance‑specific parameters such as server_id and buffer_pool_size.
Consistent data‑directory layout.
DBManage Automation Platform
The DBManage system automates instance creation, deployment, monitoring, backup, HA switching, migration and scaling with minimal human intervention.
1. Multi‑Instance Creation
Multiple MySQL processes run on a single host, each bound to a different port and using its own my.cnf while sharing the same binary.
Standardized my.cnf (only server_id, buffer_pool_size differ).
Standardized data directory.
Creation steps:
Initialize a reference database.
Copy the reference to a new instance with rsync while throttling I/O.
Adjust the instance‑specific my.cnf. Direct mysql_install_db is avoided because it is I/O‑heavy.
2. Metadata Collection & Monitoring
A custom db_agent runs on each MySQL host, discovers instances automatically and reports metadata and performance metrics.
Instance information: databases, replication topology, table metadata.
Per‑second heartbeat (more reliable than SHOW SLAVE STATUS).
Performance counters: QPS, TPS, response time (RT) collected via tcprstat.
Real‑time slow‑SQL detection.
Host‑level resource usage (CPU, disk, network) replaces external tools such as Zabbix.
3. Backup
Unified backup script works for both masters and replicas.
Only replicas are backed up after role verification.
Concurrent backup throttling across instances.
Backup destination can be HDFS or a regular file server (HDFS recommended).
4. Local Execution Agent
Remote operations (instance creation, restoration, etc.) are triggered by sending a message that invokes the corresponding script on the target host via the local agent.
5. Monitoring & Alerting
Collected metrics are stored in Graphite; dashboards and alerts are built on top of this time‑series data.
6. MySQL High‑Availability
Traditional MHA‑based HA is replaced by a Zookeeper‑driven solution that eliminates virtual IPs and uses VDDS for service discovery.
VDDS creates a new application configuration. ha_agent registers a temporary Zookeeper node and continuously updates instance status. ha_console builds switch metadata (source role, master instance, replication status, delay, last change time) from Zookeeper nodes.
The console watches the alive directory; disappearance of a node triggers a failover after verifying delay and replication health.
Switch actions are logged, e.g.
master_log_file: mysql-bin.000007, exec_master_log_pos: 57830.
{
"source_db_role": "slave",
"master_instance": "192.168.1.12_3306",
"repl_status": "ok",
"h_time_delay": 0,
"repl_delay": 0,
"last_change_time": "2016-10-15-01:00:45"
}Supported failure scenarios include instance crash, host crash, and network partition, each resulting in an automatic and safe failover.
7. DBTask – Orchestrating Database Operations
DBTask replaces manual procedures with scripted tasks. Each script performs a single atomic action; tasks are chained to implement complex workflows such as migration or resource provisioning.
Example migration workflow:
Apply for resources.
Create instance.
Restore replica A.
Restore replica B.
Configure data source in VDDS.
Pre‑switch checks.
Execute switch.
Remove VDDS configuration.
Shutdown old instance.
Example resource‑request workflow:
Apply for resources.
Create instance.
Create database and MySQL user.
Configure data source in VDDS.
Results & Outlook
The automation stack is built with Python, Shell and Go (real‑time slow‑SQL parsing) on the backend and Laravel + AngularJS on the frontend. It currently manages over 100 instances per machine, fulfills a new database request in under one minute, and supports automated sharding and scaling. Future work aims to evolve the platform into a private‑cloud service for broader business needs.
References
Python socket communication: https://github.com/chris-piekarski/python-json-socket
Python HDFS client: https://pypi.python.org/pypi/hdfs/
Response‑time tool (rt): https://github.com/Lowercases/tcprstat
Python Zookeeper client (Kazoo): https://github.com/python-zk/kazoo
TiDB SQL parser (used for table merging): https://github.com/pingcap/tidb
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
