How to Migrate 1 Billion Records Efficiently: Strategies, Code, and Pitfalls
This article shares a step‑by‑step guide for migrating billions of rows safely and quickly, covering divide‑and‑conquer batching, dual‑write architectures, tool selection, shadow testing, and rollback plans, with concrete Java and Spark code examples and practical pitfalls to avoid.
In a financial system migration project, an 8‑hour user‑data sync plan stalled, and a full‑table scan SELECT * FROM users overloaded the source database, causing an 8‑hour outage.
1. Divide and Conquer
Data migration is compared to eating a ten‑layer cake: you must cut it into small pieces.
Pitfall example: thread‑pool abuse
A team used 100 concurrent threads to insert into the new database, leading to frequent deadlocks caused by primary‑key conflicts. Batch processing must balance order and randomness.
Pagination migration template code:
long maxId = 0;
int batchSize = 1000;
while (true) {
List<User> users = jdbcTemplate.query(
"SELECT * FROM users WHERE id > ? ORDER BY id LIMIT ?",
new BeanPropertyRowMapper<>(User.class),
maxId, batchSize);
if (users.isEmpty()) {
break;
}
// batch insert into new DB (disable auto‑commit)
jdbcTemplate.batchUpdate(
"INSERT INTO new_users VALUES (?,?,?)",
users.stream()
.map(u -> new Object[]{u.id, u.name, u.email})
.collect(Collectors.toList())
);
maxId = users.get(users.size() - 1).getId();
}Pitfall guidelines:
Fetch each batch by incremental ID instead of OFFSET to avoid slowing scans.
Adjust batch size dynamically based on the target DB's write capacity (500‑5000 rows per batch).
2. Dual‑Write
For billions of rows, a pure stop‑the‑world migration is unacceptable; dual‑write is the preferred approach.
Three levels of dual‑write
Bronze : Stop writes to the old DB, migrate data, then start the new DB (uncontrollable downtime).
Gold : Synchronous dual‑write + full migration, then diff comparison and traffic cut‑over (zero data loss).
King : Reverse sync (new → old) as a safety net for post‑cut‑over anomalies.
Dual‑write can be synchronous (better real‑time, poorer performance) or asynchronous (better performance, weaker real‑time). This article adopts asynchronous dual‑write.
Asynchronous dual‑write architecture:
Core implementation
Enable dual‑write switch
@Transactional
public void createUser(User user) {
// write to old DB
oldUserRepo.save(user);
// async write to new DB (allow delay)
executor.submit(() -> {
try {
newUserRepo.save(user);
} catch (Exception e) {
log.error("New DB write failed: {}", user.getId());
retryQueue.add(user);
}
});
}Periodic diff verification
// Daily diff check at 3 AM
@Scheduled(cron = "0 0 3 * * ?")
public void checkDiff() {
long maxOldId = oldUserRepo.findMaxId();
long maxNewId = newUserRepo.findMaxId();
if (maxOldId != maxNewId) {
log.warn("Primary key max mismatch, old={} vs new={}", maxOldId, maxNewId);
repairService.fixData();
}
}3. Use the Right Tools
Different scenarios require different toolchains, just as moving furniture needs a truck and delicate items need parcels.
Tool selection overview
mysqldump – suitable for small tables; not recommended for billions of rows.
MySQL Shell – InnoDB parallel export; ~2‑4 hours.
DataX – heterogeneous source migration; speed depends on resources.
Spark – cross‑cluster large‑scale ETL; 30 minutes‑2 hours.
Spark migration core snippet:
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://source:3306/db")
.option("dbtable", "users")
.option("partitionColumn", "id")
.option("numPartitions", 100) // split by primary key
.load()
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:mysql://target:3306/db")
.option("dbtable", "new_users")
.mode(SaveMode.Append)
.save()Pitfall experience:
Number of partitions should match the number of Spark executor cores; too many partitions reduce efficiency.
Partition column must be indexed to avoid full‑table scans.
4. Shadow Testing
After migration, data consistency is verified like a pre‑flight simulation.
Shadow‑DB verification flow
Production traffic writes to both old and new databases (shadow DB).
Compare data consistency using a mix of sampling and full‑volume checks.
Validate new‑DB query latency (TP99/TP95).
Automated comparison script example:
def check_row_count(old_conn, new_conn):
old_cnt = old_conn.execute("SELECT COUNT(*) FROM users").scalar()
new_cnt = new_conn.execute("SELECT COUNT(*) FROM new_users").scalar()
assert old_cnt == new_cnt, f"Row count mismatch: old={old_cnt}, new={new_cnt}"
def check_data_sample(old_conn, new_conn):
sample_ids = old_conn.execute("SELECT id FROM users TABLESAMPLE BERNOULLI(0.1)").fetchall()
for id in sample_ids:
old_row = old_conn.execute(f"SELECT * FROM users WHERE id = {id}").fetchone()
new_row = new_conn.execute(f"SELECT * FROM new_users WHERE id = {id}").fetchone()
assert old_row == new_row, f"Data mismatch, id={id}"5. Rollback
Even with thorough preparation, a rollback plan is essential—migration is like skydiving, you need a backup parachute.
Rollback plan key points:
Backup snapshot : Full physical backup + binlog position before migration.
Traffic cut‑over : Prepare routing configuration for second‑level switch back to the old DB.
Data tagging : Mark new‑DB rows for easy cleanup of dirty data.
Quick rollback script:
# Restore old DB
mysql -h old-db < backup.sql
# Apply binlog increments
mysqlbinlog --start-position=154 ./binlog.000001 | mysql -h old-db
# Switch DNS resolution
aws route53 change-resource-record-sets --cli-input-json file://switch_to_old.jsonConclusion
Key takeaways for handling 1 billion rows:
Divide and conquer : Breaking the problem down is more important than solving it whole.
Iterative rollout : Gradually increase traffic with gray‑scale verification.
Guard the baseline : Always have a rehearsed rollback plan.
There is no 100 % guaranteed migration; only a 100 % prepared Plan B.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
