Databases 9 min read

Applying Orchestrator for High‑Availability MySQL in TAL Education Group’s Database System

This article describes how TAL Education Group evaluated, selected, and customized the open‑source Orchestrator tool to build a highly available, secure, and extensible MySQL HA solution that meets 99.99% uptime, data‑integrity, cross‑datacenter, and operational automation requirements.

TAL Education Technology
TAL Education Technology
TAL Education Technology
Applying Orchestrator for High‑Availability MySQL in TAL Education Group’s Database System

MySQL is the most widely used relational database in the group, supporting thousands of instances for core services such as tutoring and online schools, but the legacy HA solution could not satisfy the strict 99.99% availability, data‑loss‑free failover, and extensibility requirements.

To address these gaps, the team defined three key capabilities for a new HA system: rapid loss‑mitigation, accurate and consistent failover with full data integrity, and cross‑datacenter, easily extensible migration support.

After comparing mainstream MySQL HA architectures, the team chose the open‑source Orchestrator (written in Go) for its web UI, API, multi‑node raft clustering, and rich command set, and then performed secondary development to integrate it with internal platforms.

Key Orchestrator features include graphical topology editing, hook interfaces for custom operations, raft‑based cluster HA, multiple recovery modes, comprehensive primary‑selection logic, and automatic master‑failure detection that double‑checks via replicas.

The fault‑detection workflow creates three long‑lived connections per node (to master and all slaves), polls topology every 5 seconds, and confirms a master failure only after all replicas also show replication errors before triggering a failover.

Promotion logic prioritises GTID/file‑position, log_slave_updates, newer MySQL versions, row‑based binlog format, same‑room preference, absence of errant GTIDs, and finally a configurable rule hierarchy (must → prefer → neutral → prefer_not → must_not).

Enhancements were added to detect slave failures, link Orchestrator actions with the middleware layer via hooks, and connect the HA component with the database management platform to automate topology updates, scaling, migrations, and maintenance.

Before production, extensive failure‑scenario tests verified upstream‑downstream linkage stability, alerting behavior, and 100% successful failover across all identified fault cases.

Future work plans to deploy Orchestrator nodes across multiple data centers, leveraging raft majority voting for cross‑site primary election and enabling automatic proxy isolation, as well as extending Orchestrator to manage Redis, DBProxy, and other components for self‑healing capabilities.

Since March 2022, the solution has served three major business lines with over a thousand instances, achieving 100% successful failovers, reducing annual costs by ¥600 k, and providing a robust multi‑datacenter HA foundation for the group’s cloud database management system.

operationsHigh AvailabilityDatabase ArchitectureMySQLOrchestrator
TAL Education Technology
Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.