Databases 18 min read

Fast Import of 1 Billion Records into MySQL: Design, Performance, and Reliability Considerations

To import one billion 1 KB log records into MySQL efficiently, the article examines data size constraints, B‑tree index limits, batch insertion strategies, storage engine choices, file‑reading techniques, task coordination with Redis, Redisson semaphores, and distributed lock handling to ensure ordered, reliable, high‑throughput loading.

Architecture Digest

Nov 27, 2023

Fast Import of 1 Billion Records into MySQL: Design, Performance, and Reliability Considerations

To import one billion 1 KB log records into MySQL quickly, the article first clarifies the constraints: each record is 1 KB, data is unstructured logs stored in HDFS or S3, split into about 100 files, ordered import is required, and the target database is MySQL.

The feasibility of writing all records into a single MySQL table is examined. MySQL’s B+‑tree index limits suggest a practical maximum of about 20 million rows per table (three‑level index). The calculation is based on a leaf page size of 16 KB (holding 16 rows) and a non‑leaf node capacity of roughly 1 170 pointers (8‑byte primary key + 6‑byte pointer).

Level

Maximum rows

1 170 × 16 = 18 720

1 170 × 1 170 × 16 ≈ 21 902 400 ≈ 20 M

1 170 × 1 170 × 1 170 × 16 ≈ 25 625 808 000 ≈ 2.5 B

Therefore the design recommends splitting the data into about 100 tables (each holding ~10 M rows) rather than a single massive table.

For efficient writing, batch inserts are preferred; a batch size of 100 rows (each 1 KB) is suggested. MySQL InnoDB guarantees atomicity of batch transactions. The process should include retry logic, and after successful batch insertion, the task should record progress (e.g., in Redis) to enable idempotent recovery.

Concurrent writes to the same table are discouraged because they break ordering and cause index‑tree restructuring overhead. Instead, increase the batch size to raise effective concurrency while keeping writes sequential per primary‑key order. Non‑primary indexes should be avoided during the load and created afterwards.

The choice of storage engine is discussed. MyISAM offers higher raw insert speed but lacks transactional safety, while InnoDB provides ACID guarantees with comparable performance when innodb_flush_log_at_trx_commit is set to 0 or 2. A performance test image (omitted) shows InnoDB’s batch insert speed close to MyISAM when the flush‑log policy is relaxed.

Sharding considerations are presented: a single MySQL instance typically caps at ~5 K TPS. SSDs handle concurrent writes better than HDDs; therefore the system should allow dynamic configuration of the number of databases and the number of tables written concurrently, adapting to the underlying storage.

File‑reading methods are benchmarked on macOS. Files.readAllBytes causes OOM, FileReader+BufferedReader takes ~11 s, File+BufferedReader ~10 s, Scanner ~57 s, and Java NIO FileChannel with a 1 MB buffer ~3 s. Because line‑by‑line processing is needed, BufferedReader is chosen as the practical solution.

File file = new File("/xxx.zip");
FileInputStream fileInputStream = null;
long now = System.currentTimeMillis();
try {
    fileInputStream = new FileInputStream(file);
    FileChannel fileChannel = fileInputStream.getChannel();

    int capacity = 1 * 1024 * 1024; // 1M
    ByteBuffer byteBuffer = ByteBuffer.allocate(capacity);
    StringBuffer buffer = new StringBuffer();
    int size = 0;
    while (fileChannel.read(byteBuffer) != -1) {
        byteBuffer.clear();
        byte[] bytes = byteBuffer.array();
        size += bytes.length;
    }
    System.out.println("file size:" + size);
} catch (FileNotFoundException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
} finally {
    // TODO close resources.
}
System.out.println("Time:" + (System.currentTimeMillis() - now));

Task coordination combines reading and writing in a single worker. Each worker reads a chunk of a file, writes it to the assigned table, and updates its offset in Redis. The primary key can be constructed from {taskId}{fileIndex}{fileRowNumber} to guarantee idempotency without relying on auto‑increment.

The task table schema includes fields such as bizId, databaseIndex, tableIndex, parentTaskId, offset, and status. Workers acquire tasks via a polling mechanism and use Redisson semaphores to limit concurrent writes per database. However, Redisson lacks lease renewal, so the article proposes a master‑node approach using Zookeeper+Curator to assign tasks and a distributed lock with renewal to avoid stale semaphore issues.

In summary, the key points are: clarify constraints, split data into multiple tables/databases, use batch inserts with proper transaction settings, prefer InnoDB with relaxed flush policy, choose BufferedReader for file reading, record progress in Redis, and coordinate workers through a master node with Zookeeper‑based leader election and distributed locks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Performance Big Data Redis MySQL Batch Insert data import

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.