Strategies for Efficiently Importing One Billion Records into MySQL
This article analyzes the constraints of loading one billion 1 KB log records from distributed storage into MySQL, evaluates single‑table limits, proposes batch‑insert, sharding, storage‑engine, file‑reading, and distributed‑task coordination techniques to achieve high‑speed, ordered, and reliable data ingestion.
Before designing a solution, clarify the interview constraints: 1 billion records, each about 1 KB, stored as unstructured logs in HDFS or S3, split into roughly 100 files, need ordered, non‑duplicate import into a MySQL database.
Can a single MySQL table hold 1 billion rows?
The answer is no; practical single‑table limits are below 20 million rows. MySQL uses a B+‑tree clustered index where leaf pages are 16 KB. With 1 KB rows, each leaf stores 16 rows, and each non‑leaf page (also 16 KB) can hold about 1 170 pointers. The resulting index depth determines capacity:
Depth
Maximum Rows
2
1 170 × 16 = 18 720
3
1 170 × 1 170 × 16 ≈ 21 902 400 (≈20 M)
4
1 170³ × 16 ≈ 25 625 808 000 (≈2.5 B)
Therefore a practical design splits the data into multiple tables (e.g., 100 tables of 10 M rows each).
How to write efficiently
Single‑row inserts are slow; use batch inserts (e.g., 100 rows per batch). Rely on InnoDB’s transactional guarantees for atomic batch writes, implement retry logic, and avoid creating non‑primary indexes before bulk loading.
Concurrent writes to the same table
Concurrent writes to a single table break ordering guarantees, so the design avoids parallel writes to the same table and instead raises the batch size to increase effective concurrency.
Choosing the MySQL storage engine
MyISAM offers higher raw insert speed but lacks transactions, risking duplicate data on retries. InnoDB, with innodb_flush_log_at_trx_commit set to 0 or 2, provides acceptable bulk‑insert performance while ensuring atomicity.
Should we shard the database?
Single‑instance MySQL typically caps at ~5 K TPS. SSDs improve throughput, but HDDs suffer from head‑seeking when multiple tables write concurrently. The design therefore allows configurable numbers of databases and tables, adaptable to both SSD and HDD environments.
Fast file reading
Reading a 10 GB file cannot be done in memory. Benchmarks on macOS show:
Method
Result
Files.readAllBytes
OOM
FileReader + BufferedReader (line‑by‑line)
11 s
File + BufferedReader
10 s
Scanner
57 s
Java NIO FileChannel (buffered)
3 s
Although NIO is fastest, it reads fixed‑size buffers without line boundaries, complicating parsing. BufferedReader provides acceptable performance (~30 s for 10 GB) and natural line handling, so it is chosen.
File file = new File("/xxx.zip");
FileInputStream fileInputStream = null;
long now = System.currentTimeMillis();
try {
fileInputStream = new FileInputStream(file);
FileChannel fileChannel = fileInputStream.getChannel();
int capacity = 1 * 1024 * 1024; // 1M
ByteBuffer byteBuffer = ByteBuffer.allocate(capacity);
StringBuffer buffer = new StringBuffer();
int size = 0;
while (fileChannel.read(byteBuffer) != -1) {
byteBuffer.clear();
byte[] bytes = byteBuffer.array();
size += bytes.length;
}
System.out.println("file size:" + size);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
// TODO close resources
}
System.out.println("Time:" + (System.currentTimeMillis() - now));Coordinating read and write tasks
Attempting to separate readers and writers via Kafka introduces ordering challenges; partitions would mix records from different tables, breaking the single‑table batch constraint. The final design merges reading and writing: each task reads a chunk, parses it, and writes the batch directly.
Ensuring reliability
Use a composite primary key (file index + line number) to guarantee idempotent inserts. Track task progress in Redis with INCRBY commands; on failure, retry the batch, then fall back to single‑row inserts while updating Redis. If a task restarts, read the stored offset and resume.
Controlling concurrency
Tasks share a semaphore (Redisson) to limit the number of concurrent writes per database. However, Redisson lacks lease renewal, so the design prefers a leader node (selected via Zookeeper+Curator) that assigns tasks, publishes messages, and uses distributed locks with renewal to protect long‑running jobs.
Summary
Key take‑aways: clarify constraints, split data into multiple tables/databases, batch inserts with appropriate storage engine, use BufferedReader for file reading, track progress in Redis, and coordinate tasks via a leader‑based scheduler to achieve high‑speed, ordered, and reliable ingestion of one billion records into MySQL.
Top Architecture Tech Stack
Sharing Java and Python tech insights, with occasional practical development tool tips.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.