Scaling and Migrating a High‑Volume Order System with Sharding, Data Synchronization and Gray‑Rollout on Alibaba Cloud
To support Gaode Taxi’s soaring order volume, the team expanded from four to eight ECS instances, re‑sharded 256 tables into 4,096, built a custom binlog‑to‑Kafka sync middleware for full‑load and incremental migration, implemented rigorous validation and repair processes, and employed a gray‑rollout with ABC verification, completing the migration without code changes or incidents.
Background – In 2020 the author was responsible for a high‑traffic order system of Gaode Taxi. The system originally ran on 4 ECS instances, each with a single RDS database containing 64 tables (256 tables total). Some tables had already exceeded ten million rows and were projected to reach hundreds of millions within a year, causing performance concerns.
Capacity Planning – To handle the growth the team expanded to 8 instances (16C/64G/3T SSD) and increased the number of tables per database to 128, resulting in 4,096 tables. This allowed a capacity of ~5 × 10⁷ daily orders for three years. The plan also considered future instance upgrades (e.g., 16C/128G, 32C/256G) and the possibility of adding more instances without re‑hashing data.
Data Migration – Because the migration involved re‑hashing (256 → 4,096 tables), Alibaba Cloud DTS could not be used directly. A custom data‑sync middleware was built on top of DTS's binlog‑to‑Kafka capability. The migration workflow includes preparation, full‑load sync, incremental sync, re‑hash, data validation, and repair.
Preparation – Identify a unique business ID for each table and ensure every table has a unique index (adding composite indexes where missing). For tables lacking a unique index, add one to avoid duplicate rows during sync.
Data Synchronization – The overall scheme uses binlog streaming to Kafka, then the data‑sync service consumes, filters, merges, re‑hashes, splits by target table, and performs batch inserts/updates. Key steps:
Full‑load: cursor‑based batch select from the old DB, re‑hash, batch insert into the new DB (requires rewriteBatchedStatements=true).
Incremental: start a one‑way Kafka consumer before full‑load to capture ongoing changes, then consume after full‑load completes.
Real‑time bidirectional sync: uses a transaction table ( tb_transaction) to “color” data written by the middleware, preventing looped binlog consumption.
The transaction handling code is shown below:
# 开启事务,用事务保证一下sql的原子性和一致性
start transaction;
set autocommit = 0;
# 更新事务表status=1,标识后面的业务数据开始染色
update tb_transaction set status = 1 where tablename = ${tableName};
# 以下是业务产生binlog
insert xxx;
update xxx;
update xxx;
# 更新事务表status=0,标识后面的业务数据失去染色
update tb_transaction set status = 0 where tablename = ${tableName};
commit;During sync, the middleware discards any binlog events that occur while tb_transaction.status=1, thus breaking the potential infinite loop.
Data Validation – A data‑check service compares rows between old and new databases (full and incremental checks). Full validation checks each row in both directions; incremental validation runs every five minutes on recent changes and performs multi‑stage diff checks, alerting on persistent mismatches.
Data Repair – Two approaches: (1) for large inconsistencies, reset Kafka offsets and replay; (2) for small inconsistencies, parse SLS logs from the data‑check module to generate corrective SQL statements.
Gray‑Rollout and ABC Verification – The migration uses a gray‑rollout strategy based on Service Provider (SP) traffic split and user‑ID modulo. Before production cut‑over, an “ABC verification” is performed: purchase two new DB clusters (B and C), sync A→B for validation, then use B and C for rehearsal before final A↔C bidirectional sync.
The cut‑over steps include stopping writes, ensuring all data is synchronized, switching the data source, and finally resuming writes. The code intercepts MyBatis requests, uses the last four digits of the user ID to decide the target DB, and respects ACM‑configured gray percentages.
Conclusion – Over two months the team completed the sharding expansion and data migration without code intrusion or incidents, highlighting the importance of thorough pre‑migration analysis, robust sync mechanisms, and comprehensive validation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
