Efficient Strategies for Importing One Billion Records into MySQL
This article explains how to import 1 billion 1 KB log records stored in HDFS or S3 into MySQL by analyzing single‑table limits, using batch inserts, choosing storage engines, sharding, optimizing file‑reading methods, and coordinating distributed tasks with Redis, Redisson, and Zookeeper to ensure ordered, reliable, and high‑throughput data loading.
Before designing a solution, clarify the constraints: 1 billion rows, each 1 KB, unstructured logs stored in HDFS or S3, split into 100 files, ordered import required, and the target database is MySQL.
Can a Single MySQL Table Hold 1 Billion Rows?
No. A single table is recommended to stay below 20 million rows. The limit is derived from B+‑tree index depth: 3‑level indexes support roughly 20 million rows, while 4‑level indexes degrade performance.
Leaf nodes are 16 KB, each storing 16 rows (1 KB per row). Non‑leaf nodes also 16 KB and can hold about 1 170 pointers (14 bytes each). This yields the following capacity table:
Level
Maximum Rows
2
11 720
3
21 902 400 ≈ 20 M
4
25 625 808 000 ≈ 2.5 B
Therefore, split the data into 100 tables (≈10 M rows each) to stay within safe limits.
Efficient Writing to the Database
Single‑row inserts are slow; use batch inserts (e.g., 100 rows per batch). InnoDB guarantees transactional atomicity for batches. Enable retries; after N failed attempts, fall back to single‑row inserts and log failures.
Write in primary‑key order for best performance; avoid non‑primary indexes during bulk load, or create them after the load.
Concurrency on a Single Table
Concurrent writes to the same table break ordering; instead, increase the batch size to raise effective concurrency without parallel writes.
Choosing the MySQL Storage Engine
MyISAM offers higher raw insert speed but lacks transactions, making it unsuitable for reliable batch loads. InnoDB with innodb_flush_log_at_trx_commit set to 0 or 2 can achieve comparable performance while preserving ACID guarantees.
Should We Shard the Database?
Single‑instance MySQL typically caps at ~5 K TPS. SSDs handle concurrent writes better than HDDs; however, HDDs suffer from head contention when multiple tables are written simultaneously. Design a flexible configuration to adjust the number of databases and concurrent tables based on the underlying storage.
Fast File Reading Techniques
Reading a 10 GB file cannot be done in memory. Benchmarks on macOS show:
Method
Result
Files.readAllBytesOOM
FileReader+BufferedReader11 s
File+BufferedReader10 s
Scanner57 s
Java NIO FileChannel3 s
Although NIO is fastest, it reads fixed‑size buffers and does not align with line boundaries. BufferedReader provides line‑wise reading with acceptable performance (~30 s for 10 GB) and is sufficient because the bottleneck lies in database writes.
Coordinating Read and Write Tasks
Running 100 read‑write tasks in parallel would overload the database. Instead, each task reads a batch, then writes it, limiting the number of concurrent write streams. Kafka was considered for decoupling but discarded due to ordering and partitioning complexities.
Ensuring Task Reliability
Use a composite primary key (file suffix + line number) to guarantee idempotent inserts. Record task progress in Redis with INCRBY on a per‑task key; on failure, retry the batch or fall back to single inserts, updating Redis after each successful batch.
Controlling Concurrency with Redisson
Redisson semaphores can limit the number of simultaneous write tasks, but they lack lease renewal. A workaround is to elect a master node (via Zookeeper+Curator) that assigns tasks, acquires a distributed lock with renewal, and releases it upon completion.
Summary
Confirm constraints before designing the solution.
Split data into multiple databases/tables based on scale.
Analyze single‑database write bottlenecks to decide on sharding.
Adjust concurrent write degrees according to SSD/HDD characteristics.
Compare InnoDB and MyISAM storage engines in production.
Determine optimal batch‑insert size through testing.
Combine read and write tasks to avoid Kafka‑related ordering issues.
Persist task progress in Redis to achieve exactly‑once semantics.
Use a master node with Zookeeper+Curator for reliable task distribution and Redisson locks for exclusive execution.
Finally, the author asks readers to like, share, and follow for more content.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.