Databases 34 min read

How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System

This article recounts the development of ZzoOnlineDDL, a MySQL schema‑change platform, detailing its 1.0 limitations, the 2.0 architectural upgrades, feature set—including intelligent tool selection, timed execution, sharding support, monitoring, and retry mechanisms—and lessons learned from real‑world incidents such as MDL locks, disk pressure, and unique‑index pitfalls.

dbaplus Community
dbaplus Community
dbaplus Community
How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System

Background

The article describes the evolution of an internal MySQL schema‑change platform from a fragile 1.0 version to a robust 2.0 version that supports interactive control, sharding automation, timed execution, and comprehensive monitoring.

1.0 Era – Limitations

Business pain points : no support for adding unique indexes, column rename, interactive controls, or sharding projects; opaque error messages.

DBA pain points : unclear error prompts, manual handling of many requests, low controllability after task start, and missed opportunities to use MySQL 5.6+ online DDL.

Other issues : binlog explosion, MDL lock incidents, deadlocks, latency spikes.

2.0 Era – New Architecture

Launched on 2023‑02‑17, the platform adds:

User interaction: pause, resume, terminate tasks.

Support for adding unique indexes.

Automatic generation of work orders for horizontal sharding projects.

Rich monitoring of MDL locks, disk usage, load, and replication lag.

Timed execution that pauses tasks outside user‑specified windows.

Fast‑alter mode using ALTER TABLE for metadata‑only changes (e.g., VARCHAR expansion, column drop, index removal, column rename, default/value changes, MySQL 8.0 fast add column).

Failure rate dropped from 9.65 % (1.0, 2851 orders) to 4.25 % (2.0, 2424 orders), a 66 % reduction. Fast‑alter jobs accounted for 267 orders (11 % of total), delivering significant efficiency gains on large tables.

Feature Details

Intelligent Alter

The platform parses the ALTER statement and selects the optimal tool:

Metadata‑only changes → fast‑alter (no data copy).

Otherwise prefers gh‑ost; falls back to pt‑osc only when necessary.

Rationale: fast‑alter has negligible time cost and load; gh‑ost avoids triggers and deadlocks; pt‑osc is avoided because its triggers can cause deadlocks and data‑loss risk when adding unique indexes.

Sharding Support

For horizontal partitioning projects, the platform automatically creates work orders for each physical table, eliminating manual submission of dozens of orders.

Timed Execution

Users define a time window; if a task cannot finish within the window it is paused and resumed later, preventing overload during peak periods.

Unique‑Index Handling

A hidden UI element appears only when the SQL adds a unique index. By default submission is blocked; the user must acknowledge a risk document. When using pt‑osc, the platform disables unique‑index addition via pt‑osc --check-unique-key-change to avoid data loss.

User Interaction & Monitoring

Progress and ETA are displayed on the order page and pushed to WeChat/email. Notification frequency adapts to remaining time. Only the requester or DBA can pause, terminate, or resume tasks.

Monitoring includes:

MDL locks – kills empty connections, then long‑sleep connections, then pauses or fails after 30 min.

Disk space – requires at least twice the target table size; fails if insufficient.

Replication lag – checks before start; waits up to 600 s before proceeding or aborts.

Load – checks active connections; waits 30 s, then aborts if still high.

Tool‑specific flags used: gh‑ost --max‑lag‑millis,

gh‑ost --max‑load
pt‑osc --max‑lag

, pt‑osc --max‑load,

pt‑osc --critical‑load

Weak Atomicity & Retry

For multi‑statement orders, the platform pre‑runs statements in a test environment. If a failure occurs, a retry mechanism skips already‑successful statements and resumes from the point of error within the same order.

Cleanup

After task completion, temporary tables and triggers created by gh‑ost or pt‑osc are removed. For pt‑osc, trigger cleanup occurs before dropping temporary tables to avoid DML failures.

Problem Analysis

SQL Parsing

Accurately determining fast‑alter eligibility is complex. The parser strips comments (/* */, #, --, COMMENT ''), ignores parenthesized sections, and applies heuristics (e.g., disallowing unique‑index addition with other operations, restricting CHANGE to rename only). Mis‑classification can lead to unintended full table copies or replication lag.

VARCHAR Expansion

Changing VARCHAR length requires the same byte‑length for the length prefix. If the column is indexed, the platform may mistakenly treat it as fast‑alter, causing a full rebuild and latency spikes. Some non‑indexed expansions also trigger full copies due to internal statistics or constraints, which are hard to reproduce.

Unique Indexes

Unique indexes provide no read‑performance advantage over regular indexes but increase write cost (no change‑buffer). The platform recommends normal indexes and, when a unique index is required, advises using gh‑ost with hook support for large tables, avoiding pt‑osc due to its trigger‑based implementation.

Conclusion

The 2.0 platform transforms a fragile script into a production‑grade service with interactive controls, automated sharding, timed execution, extensive monitoring, and safety nets such as retries and weak atomicity safeguards. Real‑world incidents (MDL locks, disk pressure, latency, deadlocks) guided the design improvements, resulting in a system that behaves like an intelligent robot assisting DBAs with schema changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringmysqlOnline DDLgh-ostSchema Changept-osc
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.