Databases 19 min read

How to Import 1 Billion Records into MySQL at Lightning Speed

This guide explains how to efficiently load one billion 1‑KB log entries from HDFS or S3 into MySQL by analyzing B‑tree limits, using batch inserts, choosing the right storage engine, sharding tables, optimizing file reading, and coordinating tasks with Redis, Redisson, and Zookeeper.

ITPUB
ITPUB
ITPUB
How to Import 1 Billion Records into MySQL at Lightning Speed

Understanding the Constraints

Before designing a solution, clarify the data characteristics: 1 billion rows, each about 1 KB, stored as unstructured logs in HDFS or S3, split into roughly 100 files, requiring ordered, deduplicated import into a MySQL database.

Can a Single MySQL Table Hold 1 Billion Rows?

MySQL’s primary‑key index is a B+ tree. With a leaf‑page size of 16 KB and each row 1 KB, a leaf page holds 16 rows. A non‑leaf page (16 KB) can store about 1 170 child pointers (8‑byte BigInt key + 6‑byte pointer). This yields the following capacity:

2‑level index: ~18 720 rows

3‑level index: ~21 902 400 rows (≈20 million)

4‑level index: ~25 625 808 000 rows (≈2.5 billion)

Since a 3‑level index tops out near 20 million rows, a single table cannot efficiently store 1 billion rows. The practical recommendation is to split the data into multiple tables (e.g., 1 KW per table, resulting in about 100 tables).

Efficient Write Strategies

Individual inserts are slow; batch inserts dramatically improve throughput. A reasonable starting batch size is 100 rows, adjustable based on testing.

Use InnoDB transactions to guarantee atomic batch writes. If a batch fails after N retries, fall back to inserting rows individually and log failures.

Write rows in primary‑key order to keep inserts sequential. Avoid non‑primary indexes during bulk load; create them after data is loaded.

Should You Write Concurrently to the Same Table?

Concurrent writes to a single table break ordering guarantees and cause index‑tree contention. Instead, increase the batch size to raise effective concurrency without parallel writes to the same table.

Choosing the MySQL Storage Engine

MyISAM offers higher raw insert speed but lacks transactional guarantees, making it unsuitable for reliable batch loads. InnoDB, especially with innodb_flush_log_at_trx_commit set to 0 or 2, provides a good balance of safety and performance. Adjust this setting only if the production environment permits.

Sharding and Table Distribution

For SSD storage, a single database can handle ~5 K TPS; for HDD, concurrency is limited by the single disk head. Design the system to configure the number of databases and the number of tables per database dynamically, allowing runtime tuning based on hardware.

Optimizing File Reading

Reading 10 GB files cannot be done with Files.readAllBytes (OOM). Benchmarks on macOS show:

FileReader + BufferedReader: ~11 s

File + BufferedReader: ~10 s

Scanner: ~57 s

Java NIO FileChannel with buffer: ~3 s

Because the overall bottleneck is database insertion, a 30‑second read using BufferedReader (line‑by‑line) is sufficient. The final design uses BufferedReader for its line‑oriented API and acceptable performance.

Task Coordination and Reliability

Each record’s primary key can be composed of {fileIndex}{lineNumber} to guarantee uniqueness and idempotency. If multiple import jobs run, prepend a taskId (e.g., {taskId}_{fileIndex}_{lineNumber}) and store progress in Redis using INCRBY. On failure, retry the batch; after repeated failures, insert rows individually and update Redis accordingly.

When a task crashes, read the stored offset from Redis and resume from that line, avoiding duplicate inserts.

Controlling Concurrency

Limit the number of simultaneous write tasks per database using Redisson semaphores (permits = 1). Each worker acquires a permit before processing and releases it after completion. To handle semaphore leaks (e.g., process crash), set a lease timeout and implement renewal logic.

If Redisson cannot renew leases, switch to a leader‑election model with Zookeeper + Curator: the leader assigns tasks, workers acquire a distributed lock for exclusive processing, and the lock is renewed until the task finishes.

Final Recommendations

Confirm all constraints (file size, format, ordering, deduplication) before design.

Split the dataset into multiple tables/databases based on B+‑tree capacity.

Use batch inserts with adjustable thresholds; test to find the optimal size.

Prefer InnoDB with tuned innodb_flush_log_at_trx_commit unless MyISAM is explicitly allowed.

Read files with BufferedReader (line‑by‑line) to balance memory usage and speed.

Combine reading and writing in the same task to avoid excessive write concurrency.

Track progress in Redis for fault‑tolerant resumability.

Coordinate workers via Redisson semaphores or Zookeeper‑based leader election to ensure exclusive access and handle timeouts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBig DatamysqlBatch Insertdata importDistributed Tasks
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.