Databases 20 min read

How to Import 1 Billion Records into MySQL Efficiently: Architecture & Performance Guide

This article walks through the full engineering process of importing a billion 1 KB log records into MySQL, covering data size calculations, B‑tree index limits, batch‑insert strategies, storage‑engine trade‑offs, fast file‑reading techniques, task coordination with Redis and Redisson, and distributed‑lock design using Zookeeper.

Architect

Nov 28, 2023

Problem Setup and Constraints

Load 1 billion rows (≈1 KB each) of unstructured access‑log data stored in HDFS or S3, split into roughly 100 files, into MySQL while preserving order and minimizing duplicates.

Why a Single MySQL Table Won’t Scale

MySQL primary‑key index is a B+‑tree. A leaf page is 16 KB, so with 1 KB rows each leaf holds 16 rows. A non‑leaf page (also 16 KB) stores only the primary‑key (8 bytes) and a pointer (6 bytes) → 14 bytes per entry, allowing 16*1024/14≈1170 child pointers per non‑leaf node.

Maximum rows per index depth:

Depth 2: 1170 × 16 = 18 720 rows

Depth 3: 1170 × 1170 × 16 ≈ 2 × 10⁷ (≈20 M) rows

Depth 4: 1170³ × 16 ≈ 2.56 × 10¹⁰ (≈256 B) rows

Since a single table comfortably stays under ~20 M rows, the data is split into 100 tables, each holding ~10 M rows.

Efficient Write Strategy

Use batch inserts (default 100 rows per batch) to amortize per‑row overhead.

Rely on InnoDB’s transactional guarantees so a batch either fully succeeds or fails.

Implement retry logic: on failure, retry N times, then fall back to single‑row inserts and log failures.

Insert rows in primary‑key order; avoid non‑primary indexes during bulk load, or create them after the load.

Storage Engine Choice

MyISAM offers higher raw insert speed but lacks transactional safety. InnoDB provides ACID guarantees. Benchmarks (see chart) show MyISAM is faster, but disabling InnoDB’s innodb_flush_log_at_trx_commit (set to 0 or 2) narrows the gap. Default to InnoDB unless the cluster forbids changing the flush policy.

File‑Reading Performance

Reading a 10 GB file in one shot is impossible. Five Java approaches were benchmarked on macOS (3.4 GB file): Files.readAllBytes – OOM FileReader+BufferedReader – 11 s File+BufferedReader – 10 s Scanner – 57 s Java NIO FileChannel (buffered) – 3 s

Although NIO is fastest, it reads fixed‑size buffers and cannot guarantee line boundaries, making BufferedReader the practical choice (≈30 s for a 10 GB file) because the overall bottleneck is the database write.

Ensuring Ordered Writes

Each file’s name and line number form a unique key (e.g., index_90.txt → database_9.table_0). By routing each file to a dedicated table, per‑table order is preserved, and global order is achieved via the file‑index prefix.

Task Coordination and Reliability

Import tasks are represented in a task table with fields: bizId – business line identifier databaseIndex – assigned database suffix tableIndex – assigned table suffix parentTaskId – overall job identifier offset – current progress offset status – execution state (e.g., Pending, Processing, Done)

Workers poll the table, claim a task using a Redisson semaphore (permit = 1), process its batch, update offset in Redis via INCRBY, and release the semaphore. If a worker crashes, the semaphore may be orphaned; a master node (selected via Zookeeper + Curator) re‑assigns abandoned tasks.

Distributed Lock & Master Election

The master periodically scans for pending tasks. When the number of active tasks is below the configured concurrency limit, it picks the smallest‑id pending task, marks it Processing, and publishes a message. Workers consume the message, acquire a Redisson lock with lease renewal, execute the batch, then release the lock.

If concurrency equals the limit, the master checks for lost locks and re‑queues those tasks. If concurrency exceeds the limit, an alarm is raised for manual intervention.

Final Recommendations

Confirm file size, split count, and ordering constraints before designing.

Split data into multiple databases/tables based on B+‑tree capacity (≈10 M rows per table).

Use batch inserts with a dynamic batch size; tune the threshold experimentally.

Prefer InnoDB with innodb_flush_log_at_trx_commit set to 0 or 2; benchmark MyISAM if policy changes are prohibited.

Read files with BufferedReader for line‑wise processing.

Combine read‑and‑write in the same worker to avoid cross‑task ordering issues.

Track progress in Redis; on restart, resume from the stored offset to guarantee idempotency.

Coordinate workers via a Zookeeper‑elected master and Redisson locks to enforce exclusive writes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Redis mysql Batch Insert redisson data import Distributed Tasks

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.