Databases 18 min read

Strategies for Efficiently Importing One Billion Records into MySQL

This article analyzes the constraints of loading one billion 1 KB log records from distributed storage into MySQL, evaluates single‑table limits, proposes batch‑insert, sharding, storage‑engine, file‑reading, and distributed‑task coordination techniques to achieve high‑speed, ordered, and reliable data ingestion.

Top Architecture Tech Stack

Nov 2, 2023

Strategies for Efficiently Importing One Billion Records into MySQL

Before designing a solution, clarify the interview constraints: 1 billion records, each about 1 KB, stored as unstructured logs in HDFS or S3, split into roughly 100 files, need ordered, non‑duplicate import into a MySQL database.

Can a single MySQL table hold 1 billion rows?

The answer is no; practical single‑table limits are below 20 million rows. MySQL uses a B+‑tree clustered index where leaf pages are 16 KB. With 1 KB rows, each leaf stores 16 rows, and each non‑leaf page (also 16 KB) can hold about 1 170 pointers. The resulting index depth determines capacity:

Depth

Maximum Rows

1 170 × 16 = 18 720

1 170 × 1 170 × 16 ≈ 21 902 400 (≈20 M)

1 170³ × 16 ≈ 25 625 808 000 (≈2.5 B)

Therefore a practical design splits the data into multiple tables (e.g., 100 tables of 10 M rows each).

How to write efficiently

Single‑row inserts are slow; use batch inserts (e.g., 100 rows per batch). Rely on InnoDB’s transactional guarantees for atomic batch writes, implement retry logic, and avoid creating non‑primary indexes before bulk loading.

Concurrent writes to the same table

Concurrent writes to a single table break ordering guarantees, so the design avoids parallel writes to the same table and instead raises the batch size to increase effective concurrency.

Choosing the MySQL storage engine

MyISAM offers higher raw insert speed but lacks transactions, risking duplicate data on retries. InnoDB, with innodb_flush_log_at_trx_commit set to 0 or 2, provides acceptable bulk‑insert performance while ensuring atomicity.

Should we shard the database?

Single‑instance MySQL typically caps at ~5 K TPS. SSDs improve throughput, but HDDs suffer from head‑seeking when multiple tables write concurrently. The design therefore allows configurable numbers of databases and tables, adaptable to both SSD and HDD environments.

Fast file reading

Reading a 10 GB file cannot be done in memory. Benchmarks on macOS show:

Method

Result

Files.readAllBytes

OOM

FileReader + BufferedReader (line‑by‑line)

11 s

File + BufferedReader

10 s

Scanner

57 s

Java NIO FileChannel (buffered)

3 s

Although NIO is fastest, it reads fixed‑size buffers without line boundaries, complicating parsing. BufferedReader provides acceptable performance (~30 s for 10 GB) and natural line handling, so it is chosen.

File file = new File("/xxx.zip");
FileInputStream fileInputStream = null;
long now = System.currentTimeMillis();
try {
    fileInputStream = new FileInputStream(file);
    FileChannel fileChannel = fileInputStream.getChannel();
    int capacity = 1 * 1024 * 1024; // 1M
    ByteBuffer byteBuffer = ByteBuffer.allocate(capacity);
    StringBuffer buffer = new StringBuffer();
    int size = 0;
    while (fileChannel.read(byteBuffer) != -1) {
        byteBuffer.clear();
        byte[] bytes = byteBuffer.array();
        size += bytes.length;
    }
    System.out.println("file size:" + size);
} catch (FileNotFoundException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
} finally {
    // TODO close resources
}
System.out.println("Time:" + (System.currentTimeMillis() - now));

Coordinating read and write tasks

Attempting to separate readers and writers via Kafka introduces ordering challenges; partitions would mix records from different tables, breaking the single‑table batch constraint. The final design merges reading and writing: each task reads a chunk, parses it, and writes the batch directly.

Ensuring reliability

Use a composite primary key (file index + line number) to guarantee idempotent inserts. Track task progress in Redis with INCRBY commands; on failure, retry the batch, then fall back to single‑row inserts while updating Redis. If a task restarts, read the stored offset and resume.

Controlling concurrency

Tasks share a semaphore (Redisson) to limit the number of concurrent writes per database. However, Redisson lacks lease renewal, so the design prefers a leader node (selected via Zookeeper+Curator) that assigns tasks, publishes messages, and uses distributed locks with renewal to protect long‑running jobs.

Summary

Key take‑aways: clarify constraints, split data into multiple tables/databases, batch inserts with appropriate storage engine, use BufferedReader for file reading, track progress in Redis, and coordinate tasks via a leader‑based scheduler to achieve high‑speed, ordered, and reliable ingestion of one billion records into MySQL.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java performance Sharding Redis mysql Batch Insert data import

Written by

Top Architecture Tech Stack

Sharing Java and Python tech insights, with occasional practical development tool tips.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Can a single MySQL table hold 1 billion rows?

How to write efficiently

Concurrent writes to the same table

Choosing the MySQL storage engine

Should we shard the database?

Fast file reading

Coordinating read and write tasks

Ensuring reliability

Controlling concurrency

Summary

Top Architecture Tech Stack

How this landed with the community

Was this worth your time?

0 Comments

Can a single MySQL table hold 1 billion rows?