How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System
This article recounts the development of ZzoOnlineDDL, a MySQL schema‑change platform, detailing its 1.0 limitations, the 2.0 architectural upgrades, feature set—including intelligent tool selection, timed execution, sharding support, monitoring, and retry mechanisms—and lessons learned from real‑world incidents such as MDL locks, disk pressure, and unique‑index pitfalls.
Background
The article describes the evolution of an internal MySQL schema‑change platform from a fragile 1.0 version to a robust 2.0 version that supports interactive control, sharding automation, timed execution, and comprehensive monitoring.
1.0 Era – Limitations
Business pain points : no support for adding unique indexes, column rename, interactive controls, or sharding projects; opaque error messages.
DBA pain points : unclear error prompts, manual handling of many requests, low controllability after task start, and missed opportunities to use MySQL 5.6+ online DDL.
Other issues : binlog explosion, MDL lock incidents, deadlocks, latency spikes.
2.0 Era – New Architecture
Launched on 2023‑02‑17, the platform adds:
User interaction: pause, resume, terminate tasks.
Support for adding unique indexes.
Automatic generation of work orders for horizontal sharding projects.
Rich monitoring of MDL locks, disk usage, load, and replication lag.
Timed execution that pauses tasks outside user‑specified windows.
Fast‑alter mode using ALTER TABLE for metadata‑only changes (e.g., VARCHAR expansion, column drop, index removal, column rename, default/value changes, MySQL 8.0 fast add column).
Failure rate dropped from 9.65 % (1.0, 2851 orders) to 4.25 % (2.0, 2424 orders), a 66 % reduction. Fast‑alter jobs accounted for 267 orders (11 % of total), delivering significant efficiency gains on large tables.
Feature Details
Intelligent Alter
The platform parses the ALTER statement and selects the optimal tool:
Metadata‑only changes → fast‑alter (no data copy).
Otherwise prefers gh‑ost; falls back to pt‑osc only when necessary.
Rationale: fast‑alter has negligible time cost and load; gh‑ost avoids triggers and deadlocks; pt‑osc is avoided because its triggers can cause deadlocks and data‑loss risk when adding unique indexes.
Sharding Support
For horizontal partitioning projects, the platform automatically creates work orders for each physical table, eliminating manual submission of dozens of orders.
Timed Execution
Users define a time window; if a task cannot finish within the window it is paused and resumed later, preventing overload during peak periods.
Unique‑Index Handling
A hidden UI element appears only when the SQL adds a unique index. By default submission is blocked; the user must acknowledge a risk document. When using pt‑osc, the platform disables unique‑index addition via pt‑osc --check-unique-key-change to avoid data loss.
User Interaction & Monitoring
Progress and ETA are displayed on the order page and pushed to WeChat/email. Notification frequency adapts to remaining time. Only the requester or DBA can pause, terminate, or resume tasks.
Monitoring includes:
MDL locks – kills empty connections, then long‑sleep connections, then pauses or fails after 30 min.
Disk space – requires at least twice the target table size; fails if insufficient.
Replication lag – checks before start; waits up to 600 s before proceeding or aborts.
Load – checks active connections; waits 30 s, then aborts if still high.
Tool‑specific flags used: gh‑ost --max‑lag‑millis,
gh‑ost --max‑load pt‑osc --max‑lag, pt‑osc --max‑load,
pt‑osc --critical‑loadWeak Atomicity & Retry
For multi‑statement orders, the platform pre‑runs statements in a test environment. If a failure occurs, a retry mechanism skips already‑successful statements and resumes from the point of error within the same order.
Cleanup
After task completion, temporary tables and triggers created by gh‑ost or pt‑osc are removed. For pt‑osc, trigger cleanup occurs before dropping temporary tables to avoid DML failures.
Problem Analysis
SQL Parsing
Accurately determining fast‑alter eligibility is complex. The parser strips comments (/* */, #, --, COMMENT ''), ignores parenthesized sections, and applies heuristics (e.g., disallowing unique‑index addition with other operations, restricting CHANGE to rename only). Mis‑classification can lead to unintended full table copies or replication lag.
VARCHAR Expansion
Changing VARCHAR length requires the same byte‑length for the length prefix. If the column is indexed, the platform may mistakenly treat it as fast‑alter, causing a full rebuild and latency spikes. Some non‑indexed expansions also trigger full copies due to internal statistics or constraints, which are hard to reproduce.
Unique Indexes
Unique indexes provide no read‑performance advantage over regular indexes but increase write cost (no change‑buffer). The platform recommends normal indexes and, when a unique index is required, advises using gh‑ost with hook support for large tables, avoiding pt‑osc due to its trigger‑based implementation.
Conclusion
The 2.0 platform transforms a fragile script into a production‑grade service with interactive controls, automated sharding, timed execution, extensive monitoring, and safety nets such as retries and weak atomicity safeguards. Real‑world incidents (MDL locks, disk pressure, latency, deadlocks) guided the design improvements, resulting in a system that behaves like an intelligent robot assisting DBAs with schema changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
