How to Import 1 Billion Records into MySQL at Lightning Speed
This guide explains how to efficiently load one billion 1‑KB log entries from HDFS or S3 into MySQL by analyzing B‑tree limits, using batch inserts, choosing the right storage engine, sharding tables, optimizing file reading, and coordinating tasks with Redis, Redisson, and Zookeeper.
Understanding the Constraints
Before designing a solution, clarify the data characteristics: 1 billion rows, each about 1 KB, stored as unstructured logs in HDFS or S3, split into roughly 100 files, requiring ordered, deduplicated import into a MySQL database.
Can a Single MySQL Table Hold 1 Billion Rows?
MySQL’s primary‑key index is a B+ tree. With a leaf‑page size of 16 KB and each row 1 KB, a leaf page holds 16 rows. A non‑leaf page (16 KB) can store about 1 170 child pointers (8‑byte BigInt key + 6‑byte pointer). This yields the following capacity:
2‑level index: ~18 720 rows
3‑level index: ~21 902 400 rows (≈20 million)
4‑level index: ~25 625 808 000 rows (≈2.5 billion)
Since a 3‑level index tops out near 20 million rows, a single table cannot efficiently store 1 billion rows. The practical recommendation is to split the data into multiple tables (e.g., 1 KW per table, resulting in about 100 tables).
Efficient Write Strategies
Individual inserts are slow; batch inserts dramatically improve throughput. A reasonable starting batch size is 100 rows, adjustable based on testing.
Use InnoDB transactions to guarantee atomic batch writes. If a batch fails after N retries, fall back to inserting rows individually and log failures.
Write rows in primary‑key order to keep inserts sequential. Avoid non‑primary indexes during bulk load; create them after data is loaded.
Should You Write Concurrently to the Same Table?
Concurrent writes to a single table break ordering guarantees and cause index‑tree contention. Instead, increase the batch size to raise effective concurrency without parallel writes to the same table.
Choosing the MySQL Storage Engine
MyISAM offers higher raw insert speed but lacks transactional guarantees, making it unsuitable for reliable batch loads. InnoDB, especially with innodb_flush_log_at_trx_commit set to 0 or 2, provides a good balance of safety and performance. Adjust this setting only if the production environment permits.
Sharding and Table Distribution
For SSD storage, a single database can handle ~5 K TPS; for HDD, concurrency is limited by the single disk head. Design the system to configure the number of databases and the number of tables per database dynamically, allowing runtime tuning based on hardware.
Optimizing File Reading
Reading 10 GB files cannot be done with Files.readAllBytes (OOM). Benchmarks on macOS show:
FileReader + BufferedReader: ~11 s
File + BufferedReader: ~10 s
Scanner: ~57 s
Java NIO FileChannel with buffer: ~3 s
Because the overall bottleneck is database insertion, a 30‑second read using BufferedReader (line‑by‑line) is sufficient. The final design uses BufferedReader for its line‑oriented API and acceptable performance.
Task Coordination and Reliability
Each record’s primary key can be composed of {fileIndex}{lineNumber} to guarantee uniqueness and idempotency. If multiple import jobs run, prepend a taskId (e.g., {taskId}_{fileIndex}_{lineNumber}) and store progress in Redis using INCRBY. On failure, retry the batch; after repeated failures, insert rows individually and update Redis accordingly.
When a task crashes, read the stored offset from Redis and resume from that line, avoiding duplicate inserts.
Controlling Concurrency
Limit the number of simultaneous write tasks per database using Redisson semaphores (permits = 1). Each worker acquires a permit before processing and releases it after completion. To handle semaphore leaks (e.g., process crash), set a lease timeout and implement renewal logic.
If Redisson cannot renew leases, switch to a leader‑election model with Zookeeper + Curator: the leader assigns tasks, workers acquire a distributed lock for exclusive processing, and the lock is renewed until the task finishes.
Final Recommendations
Confirm all constraints (file size, format, ordering, deduplication) before design.
Split the dataset into multiple tables/databases based on B+‑tree capacity.
Use batch inserts with adjustable thresholds; test to find the optimal size.
Prefer InnoDB with tuned innodb_flush_log_at_trx_commit unless MyISAM is explicitly allowed.
Read files with BufferedReader (line‑by‑line) to balance memory usage and speed.
Combine reading and writing in the same task to avoid excessive write concurrency.
Track progress in Redis for fault‑tolerant resumability.
Coordinate workers via Redisson semaphores or Zookeeper‑based leader election to ensure exclusive access and handle timeouts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
