How Ctrip’s DRC Middleware Enables Real‑Time Multi‑Active MySQL Replication
DRC (Data Replicate Center) is Ctrip’s database middleware that provides real‑time bidirectional MySQL replication, achieving low‑latency, multi‑active data access across data centers while ensuring consistency through GTID, writeset, conflict resolution, DDL handling, and comprehensive monitoring.
Background
Ctrip operates MySQL clusters across two data centers. Data center A hosts a primary‑replica pair, while data center B hosts a replica used for disaster‑recovery (DR). In the original configuration, applications in B had to write to A, and DBA staff performed manual DR failover when A failed.
To enable true multi‑active, geographically distributed reads and writes without manual DR, Ctrip built a real‑time bidirectional (and multi‑directional) replication component.
DRC Overview
DRC (Data Replicate Center) is a database middleware that provides bidirectional or multi‑directional replication. It supports Ctrip’s G2 (global, high‑quality service) strategy and enables globally distributed deployments.
Architecture
DRC follows a centralized server design and works together with the DAL (Data Access Layer) middleware, which supplies local read‑write capability. The main components are:
Replicator Container : Manages Replicator instances. Each instance pretends to be a MySQL slave, pulls binlogs from a source cluster, and stores them locally.
Applier Container : Manages Applier instances. An Applier connects to a Replicator, reads the stored binlogs, parses the SQL statements, and applies them to the target MySQL.
Cluster Manager : Handles high‑availability switching, including restarts caused by primary‑replica switches and role changes of Replicator/Applier.
Console : Exposes UI operations, external APIs, and monitoring/alerting interfaces.
DB Access Requirements
To keep replication latency low and data consistency high, every participating MySQL instance must satisfy:
MySQL version 5.7.22 or newer.
Writeset parallel replication enabled on the master (available from 5.7.22).
GTID enabled.
Each table contains a millisecond‑precision timestamp column.
Each table has a primary key or a unique key.
GTID Primer
GTID (Global Transaction ID) was introduced in MySQL 5.6.5. It replaces file‑position based replication with a format source_id:transaction_id, where source_id is the server UUID and transaction_id is a sequential number assigned at commit. GTID allows precise binlog positioning after failover and is the basis for DRC’s ordering guarantees.
Binlog Replication Pipeline
A unidirectional replication chain consists of:
Replicator : Pulls binlog events from the source MySQL, writes them to local disk, and makes them available to Applier.
Applier : Requests stored binlog events, parses them into SQL, and applies them in parallel to the target MySQL.
The pipeline involves network I/O, disk reads/writes, and CPU processing.
Latency Optimizations
Latency is reduced at three layers:
Network Layer
Replicator uses GTID‑based replication and the open‑source XPipe component (https://github.com/ctripcorp/x-pipe) for asynchronous network communication.
System Layer
Binlog events are parsed and kept in off‑heap memory. Heartbeat events and events from irrelevant databases/tables are filtered out. Persisted events are written to the OS page cache and flushed periodically, minimizing disk I/O.
Application Layer
Applier adopts MySQL’s Writeset parallel replication algorithm with a water‑level based parallelism scheme, allowing many SQL statements to be applied concurrently.
Idle Detection & Flow Control
Both Replicator and Applier send a heartbeat every 10 s; a 30 s timeout triggers reconnection. Replicator also uses Netty’s WRITE_BUFFER_WATER_MARK to throttle sending when the Applier cannot keep up.
Data Consistency Guarantees
DRC ensures three properties:
Ordering : Binlog files are stored using MySQL’s native format. Replicator processes events sequentially, preserving the original order even for custom DDL events.
At‑Least‑Once Delivery : Guarantees no loss and idempotent execution.
Conflict Resolution : Provides eventual consistency when concurrent updates occur.
Ordering Details
During pull, Replicator reads binlog files in native order and forwards events to Applier in the same sequence. Custom snapshot and DDL events are also ordered.
At‑Least‑Once Mechanisms
Restart Recovery : On restart, Replicator locates the last binlog file, parses the previous_gtids_event, merges GTID sets, and truncates incomplete transactions. Applier receives the current GTID set from the target DB (via Cluster Manager) and requests only missing events.
Loop‑Replication Avoidance : DRC tags transactions that originate from DRC with a special marker. The opposite Replicator filters out marked transactions. Additionally, GTID‑based source‑UUID filtering prevents cycles.
Idempotence : MySQL records executed GTIDs. If Applier receives a transaction whose GTID is already applied, MySQL silently skips it, ensuring safe duplicate delivery.
Conflict Resolution
Conflicts are minimized by routing a user’s traffic to the same data center (local‑to‑local routing in DAL) and by allocating distinct auto‑increment ID ranges or using a global ID generator. When a conflict does occur, DRC compares the millisecond‑precision timestamp columns and keeps the later update. Conflicting statements are logged and can be presented for manual review.
DDL Support
DDL changes require the Applier to know the exact table schema at the moment each binlog event was generated. DRC stores table‑structure snapshots and DDL events inside custom binlog events, eliminating the need for an external metadata store.
When a DDL event arrives, an embedded lightweight database reconstructs the required schema version for subsequent events.
For online schema changes, Ctrip uses gh‑ost, which creates a shadow table ( _xxx_gho), syncs data, and swaps tables during low‑traffic windows. DRC tracks these shadow‑table DDL events and updates its schema cache accordingly. Direct DDL on the source is also captured via binlog events and handled in the same way.
Monitoring & Alerts
DRC exposes core metrics and alerts:
Replication latency (typically < 1 s in production).
Data‑consistency checks (ordering, at‑least‑once, conflict detection).
Traffic and TPS monitoring.
Business‑unit, application, and IDC‑level alerts.
DDL change monitoring.
Table‑structure consistency alerts.
GTID set GAP monitoring.
Conclusion
DRC achieves low replication latency and strong data consistency through network‑level asynchronous I/O, system‑level zero‑copy and page‑cache usage, and application‑level parallel Writeset replication. GTID provides ordering, loop‑replication avoidance, and idempotence. Conflict handling relies on routing, ID range isolation, and timestamp‑based resolution. DDL support is realized via embedded schema snapshots and gh‑ost shadow tables, allowing online schema changes without breaking replication. Future work focuses on high availability, overseas deployment support, and further infrastructure enhancements to back Ctrip’s global strategy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
