Efficient Strategies for Importing One Billion Records into MySQL
This article explains how to import 1 billion 1 KB log records stored in HDFS or S3 into MySQL by analyzing table capacity limits, choosing storage engines, designing batch inserts, coordinating file reading and writing, and handling task reliability with Redis, Redisson, and Zookeeper.
Before designing a solution, clarify the data characteristics: 1 billion rows, each about 1 KB, unstructured logs stored in HDFS or S3, split into 100 files, requiring ordered, non‑duplicate import into a MySQL database.
Can a single MySQL table hold 1 billion rows? No. Practical experience suggests keeping a single table under 20 million rows. The limit is derived from B+‑tree index depth: a 3‑level index supports roughly 20 million rows, while a 4‑level index reaches 256 million.
Level
Maximum Rows
2
18 720
3
21 902 400 ≈ 20 M
4
25 625 808 000 ≈ 256 M
Therefore the design splits the data into 100 tables (≈10 M rows each) to stay within safe limits.
Efficient writing techniques include using batch inserts (e.g., 100 rows per batch) and relying on InnoDB's transactional guarantees. Retries should be performed on failure; after a configurable number of retries, fall back to single‑row inserts and log the failures.
Writing in primary‑key order yields the best performance. Non‑primary indexes should be avoided during bulk load or created after the load completes to prevent costly index reorganizations.
Concurrent writes to the same table are discouraged because they break ordering guarantees. Instead, increase the batch size to raise effective concurrency without parallel writes.
Choosing the MySQL storage engine : MyISAM offers higher raw insert speed but lacks transactional safety, while InnoDB provides ACID guarantees with comparable performance when innodb_flush_log_at_trx_commit is set to 0 or 2. The article recommends InnoDB unless the cluster policy forbids changing this setting.
Sharding (splitting databases) is necessary when write throughput exceeds a single instance’s capacity (≈5 K TPS). The design should allow configurable numbers of databases and concurrent write tables, adapting to SSD or HDD characteristics.
File‑reading performance was benchmarked on a 10 GB file using several Java approaches. The best result (≈3 s) came from Java NIO FileChannel with a fixed‑size buffer, while BufferedReader achieved acceptable performance (≈10–11 s) and naturally supports line‑by‑line processing, making it the preferred choice for the import pipeline.
File file = new File("/xxx.zip");
FileInputStream fileInputStream = null;
long now = System.currentTimeMillis();
try {
fileInputStream = new FileInputStream(file);
FileChannel fileChannel = fileInputStream.getChannel();
int capacity = 1 * 1024 * 1024; // 1M
ByteBuffer byteBuffer = ByteBuffer.allocate(capacity);
StringBuffer buffer = new StringBuffer();
int size = 0;
while (fileChannel.read(byteBuffer) != -1) {
byteBuffer.clear();
byte[] bytes = byteBuffer.array();
size += bytes.length;
}
System.out.println("file size:" + size);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
// TODO close resources.
}
System.out.println("Time:" + (System.currentTimeMillis() - now));Because the import bottleneck lies in database writes, the reading stage can finish in ~30 s without affecting overall throughput.
Task reliability is ensured by constructing a unique primary key from {taskId}{fileIndex}{fileRowNumber} . Redis can store the current offset for each task using INCRBY commands; on failure, the task retries and updates Redis atomically. If Redis and the database diverge, binlog consumption can be used to reconcile offsets.
Task coordination uses a distributed semaphore (Redisson) to limit concurrent writes per database. Since Redisson lacks semaphore lease renewal, the design switches to a leader‑election model using Zookeeper+Curator: the leader assigns tasks, workers acquire a distributed lock for the duration of the task, and the lock is renewed automatically.
Overall, the solution emphasizes: confirming constraints, sharding tables, limiting write concurrency, selecting the appropriate storage engine, tuning batch sizes, merging read‑write tasks, and using Redis/Zookeeper for progress tracking and coordination.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.