Databases 19 min read

How to Import 1 Billion Records into MySQL Efficiently: Design, Sharding, and Performance Tips

This article explains how to import 1 billion 1 KB log records stored in HDFS/S3 into MySQL by analyzing single‑table limits, proposing sharding into multiple tables, choosing the right storage engine, using batch inserts, controlling concurrency, and ensuring ordered, reliable processing with Redis and Redisson.

Code Ape Tech Column

Jan 30, 2026

How to Import 1 Billion Records into MySQL Efficiently: Design, Sharding, and Performance Tips

Problem Statement and Constraints

We need to import 1 billion rows (≈1 KB each) of unstructured log data stored in HDFS or S3. The data is already split into roughly 100 files, each file is ordered and should be written to MySQL while preserving order and avoiding duplicates.

Each row is 1 KB.

Data is unstructured log that must be parsed before insertion.

Source resides in HDFS or S3.

Files are split into 100 parts with sequential suffixes.

Import must be ordered and preferably deduplicated.

Target database is MySQL.

Can a Single MySQL Table Hold 1 Billion Rows?

No. A practical limit for a single InnoDB table is around 20 million rows (≈2 × 10⁷). The limit comes from the B+‑tree index depth: with a leaf page size of 16 KB and a 1 KB row, each leaf holds 16 rows; a non‑leaf page (also 16 KB) can hold about 1 170 pointers (16 KB / 14 B). Beyond three index levels, performance degrades sharply.

Therefore we design the schema to use 1 KW (≈1 000 rows) per table, resulting in about 100 tables for the full dataset.

Efficient Writing Strategies

Single‑row inserts are slow. Use batch inserts (e.g., 100 rows per batch). InnoDB guarantees atomicity for a batch within a transaction, so either all rows are written or none.

Retry failed batches; after N retries, fall back to single‑row inserts, log the failures, and discard if necessary.

Write rows in primary‑key order; avoid non‑primary indexes during the load. Create secondary indexes after the bulk load completes.

Should We Write Concurrently to the Same Table?

Do not. Concurrent writes break ordering and cause index contention. Instead, increase the batch size to raise effective concurrency.

Choosing the MySQL Storage Engine

MyISAM

offers higher raw insert speed but lacks transaction support, making it unsuitable for reliable bulk loads. InnoDB with innodb_flush_log_at_trx_commit set to 0 or 2 can approach MyISAM performance while still providing ACID guarantees (with a possible 1‑second data loss window).

Do We Need Database Sharding?

Single‑instance MySQL typically caps at ~5 K TPS. With SSDs the limit is higher; HDDs suffer from head contention when multiple tables are written concurrently. Therefore we design a flexible sharding scheme that can configure the number of databases and the number of tables written in parallel.

Configurable number of databases.

Configurable number of tables per database (e.g., limit to one table per DB on HDD).

Ensuring Ordered Writes

Use a composite primary key built from the file suffix and the line number (e.g., index_90.txt → database_9.table_0). This guarantees that rows from the same file go to the same table and that the global order can be reconstructed from the table and file identifiers.

Fast File Reading

Reading a 10 GB file cannot be done in one shot. Options include: Files.readAllBytes – loads entire file (not feasible). FileReader + BufferedReader – line‑by‑line reading. File + BufferedReader. Scanner – line‑by‑line.

Java NIO FileChannel with a fixed‑size ByteBuffer.

Benchmarks on macOS show FileChannel is fastest, but it does not preserve line boundaries. For simplicity and acceptable performance (≈30 s for a 10 GB file), BufferedReader is chosen.

File file = new File("/xxx.zip");
FileInputStream fileInputStream = null;
long now = System.currentTimeMillis();
try {
    fileInputStream = new FileInputStream(file);
    FileChannel fileChannel = fileInputStream.getChannel();
    int capacity = 1 * 1024 * 1024; // 1M
    ByteBuffer byteBuffer = ByteBuffer.allocate(capacity);
    StringBuffer buffer = new StringBuffer();
    int size = 0;
    while (fileChannel.read(byteBuffer) != -1) {
        byteBuffer.clear();
        byte[] bytes = byteBuffer.array();
        size += bytes.length;
    }
    System.out.println("file size:" + size);
} catch (FileNotFoundException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
} finally {
    // TODO close resources.
}
System.out.println("Time:" + (System.currentTimeMillis() - now));

Task Reliability and Progress Tracking

If a reader crashes or the DB becomes unavailable, we must resume without duplication. The primary key (file suffix + line number) provides idempotency.

Use Redis to store the current offset for each task: INCRBY KEY_NAME INCR_AMOUNT Example: INCRBY task_offset_5 100. On batch failure, retry; after repeated failures, fall back to single‑row inserts and update Redis accordingly. Optionally, consume MySQL binlog to keep Redis in sync.

Coordinating Task Concurrency

Each node polls a task table and attempts to acquire a Redisson semaphore keyed by the database ID. The semaphore limits the number of concurrent writes per DB.

RedissonClient redissonClient = Redisson.create(config);
RSemaphore rSemaphore = redissonClient.getSemaphore("semaphore");
// Set concurrency to 1
rSemaphore.trySetPermits(1);
rSemaphore.tryAcquire(); // non‑blocking acquire

If a semaphore is not released due to a crash, a timeout can be set, but Redisson does not support lease renewal for semaphores. As an alternative, elect a master node via Zookeeper+Curator to assign tasks and hold a distributed lock that can be renewed.

Overall Design Summary

Clarify interview constraints before designing the solution.

Split the 1 billion rows into multiple databases/tables (≈100 tables) to avoid single‑table limits.

Use batch inserts with configurable size; test thresholds (100, 1 000, 10 000) to find the optimal value.

Prefer InnoDB with innodb_flush_log_at_trx_commit set to 0/2 for speed, but fall back to MyISAM if the cluster forbids changing the setting.

Limit per‑database concurrent writes; adjust dynamically based on SSD/HDD characteristics.

Track task progress in Redis to guarantee idempotent retries.

Combine reading and writing in the same task to avoid complex Kafka pipelines.

Use a master node (Zookeeper+Curator) to coordinate task assignment and avoid semaphore timeout issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Batch Insert data import

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Statement and Constraints

Can a Single MySQL Table Hold 1 Billion Rows?

Efficient Writing Strategies

Should We Write Concurrently to the Same Table?

Choosing the MySQL Storage Engine

Do We Need Database Sharding?

Ensuring Ordered Writes

Fast File Reading

Task Reliability and Progress Tracking

Coordinating Task Concurrency

Overall Design Summary

Code Ape Tech Column

How this landed with the community

Was this worth your time?

0 Comments

Can a Single MySQL Table Hold 1 Billion Rows?