Databases 10 min read

How to Migrate 1 Billion Records Efficiently: Strategies, Code, and Pitfalls

This article shares a step‑by‑step guide for migrating billions of rows safely and quickly, covering divide‑and‑conquer batching, dual‑write architectures, tool selection, shadow testing, and rollback plans, with concrete Java and Spark code examples and practical pitfalls to avoid.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to Migrate 1 Billion Records Efficiently: Strategies, Code, and Pitfalls

In a financial system migration project, an 8‑hour user‑data sync plan stalled, and a full‑table scan SELECT * FROM users overloaded the source database, causing an 8‑hour outage.

1. Divide and Conquer

Data migration is compared to eating a ten‑layer cake: you must cut it into small pieces.

Pitfall example: thread‑pool abuse

A team used 100 concurrent threads to insert into the new database, leading to frequent deadlocks caused by primary‑key conflicts. Batch processing must balance order and randomness.

Pagination migration template code:

long maxId = 0;
int batchSize = 1000;
while (true) {
    List<User> users = jdbcTemplate.query(
        "SELECT * FROM users WHERE id > ? ORDER BY id LIMIT ?",
        new BeanPropertyRowMapper<>(User.class),
        maxId, batchSize);
    if (users.isEmpty()) {
        break;
    }
    // batch insert into new DB (disable auto‑commit)
    jdbcTemplate.batchUpdate(
        "INSERT INTO new_users VALUES (?,?,?)",
        users.stream()
            .map(u -> new Object[]{u.id, u.name, u.email})
            .collect(Collectors.toList())
    );
    maxId = users.get(users.size() - 1).getId();
}

Pitfall guidelines:

Fetch each batch by incremental ID instead of OFFSET to avoid slowing scans.

Adjust batch size dynamically based on the target DB's write capacity (500‑5000 rows per batch).

2. Dual‑Write

For billions of rows, a pure stop‑the‑world migration is unacceptable; dual‑write is the preferred approach.

Three levels of dual‑write

Bronze : Stop writes to the old DB, migrate data, then start the new DB (uncontrollable downtime).

Gold : Synchronous dual‑write + full migration, then diff comparison and traffic cut‑over (zero data loss).

King : Reverse sync (new → old) as a safety net for post‑cut‑over anomalies.

Dual‑write can be synchronous (better real‑time, poorer performance) or asynchronous (better performance, weaker real‑time). This article adopts asynchronous dual‑write.

Asynchronous dual‑write architecture:

Core implementation

Enable dual‑write switch

@Transactional
public void createUser(User user) {
    // write to old DB
    oldUserRepo.save(user);
    // async write to new DB (allow delay)
    executor.submit(() -> {
        try {
            newUserRepo.save(user);
        } catch (Exception e) {
            log.error("New DB write failed: {}", user.getId());
            retryQueue.add(user);
        }
    });
}

Periodic diff verification

// Daily diff check at 3 AM
@Scheduled(cron = "0 0 3 * * ?")
public void checkDiff() {
    long maxOldId = oldUserRepo.findMaxId();
    long maxNewId = newUserRepo.findMaxId();
    if (maxOldId != maxNewId) {
        log.warn("Primary key max mismatch, old={} vs new={}", maxOldId, maxNewId);
        repairService.fixData();
    }
}

3. Use the Right Tools

Different scenarios require different toolchains, just as moving furniture needs a truck and delicate items need parcels.

Tool selection overview

mysqldump – suitable for small tables; not recommended for billions of rows.

MySQL Shell – InnoDB parallel export; ~2‑4 hours.

DataX – heterogeneous source migration; speed depends on resources.

Spark – cross‑cluster large‑scale ETL; 30 minutes‑2 hours.

Spark migration core snippet:

val jdbcDF = spark.read
    .format("jdbc")
    .option("url", "jdbc:mysql://source:3306/db")
    .option("dbtable", "users")
    .option("partitionColumn", "id")
    .option("numPartitions", 100) // split by primary key
    .load()

jdbcDF.write
    .format("jdbc")
    .option("url", "jdbc:mysql://target:3306/db")
    .option("dbtable", "new_users")
    .mode(SaveMode.Append)
    .save()

Pitfall experience:

Number of partitions should match the number of Spark executor cores; too many partitions reduce efficiency.

Partition column must be indexed to avoid full‑table scans.

4. Shadow Testing

After migration, data consistency is verified like a pre‑flight simulation.

Shadow‑DB verification flow

Production traffic writes to both old and new databases (shadow DB).

Compare data consistency using a mix of sampling and full‑volume checks.

Validate new‑DB query latency (TP99/TP95).

Automated comparison script example:

def check_row_count(old_conn, new_conn):
    old_cnt = old_conn.execute("SELECT COUNT(*) FROM users").scalar()
    new_cnt = new_conn.execute("SELECT COUNT(*) FROM new_users").scalar()
    assert old_cnt == new_cnt, f"Row count mismatch: old={old_cnt}, new={new_cnt}"

def check_data_sample(old_conn, new_conn):
    sample_ids = old_conn.execute("SELECT id FROM users TABLESAMPLE BERNOULLI(0.1)").fetchall()
    for id in sample_ids:
        old_row = old_conn.execute(f"SELECT * FROM users WHERE id = {id}").fetchone()
        new_row = new_conn.execute(f"SELECT * FROM new_users WHERE id = {id}").fetchone()
        assert old_row == new_row, f"Data mismatch, id={id}"

5. Rollback

Even with thorough preparation, a rollback plan is essential—migration is like skydiving, you need a backup parachute.

Rollback plan key points:

Backup snapshot : Full physical backup + binlog position before migration.

Traffic cut‑over : Prepare routing configuration for second‑level switch back to the old DB.

Data tagging : Mark new‑DB rows for easy cleanup of dirty data.

Quick rollback script:

# Restore old DB
mysql -h old-db < backup.sql

# Apply binlog increments
mysqlbinlog --start-position=154 ./binlog.000001 | mysql -h old-db

# Switch DNS resolution
aws route53 change-resource-record-sets --cli-input-json file://switch_to_old.json

Conclusion

Key takeaways for handling 1 billion rows:

Divide and conquer : Breaking the problem down is more important than solving it whole.

Iterative rollout : Gradually increase traffic with gray‑scale verification.

Guard the baseline : Always have a rehearsed rollback plan.

There is no 100 % guaranteed migration; only a 100 % prepared Plan B.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data MigrationBig DatadatabasesSpark
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.