Databases 17 min read

Designing an Efficient Pipeline for Importing One Billion Records into MySQL

This article presents a comprehensive engineering guide for importing one billion 1 KB unstructured log records stored in HDFS or S3 into MySQL, covering data sizing, B‑tree limits, batch insertion strategies, storage‑engine choices, sharding, file‑reading techniques, concurrency control, and reliable task coordination using Redis, Redisson, and Zookeeper.

Architect's Guide

Oct 31, 2024

Designing an Efficient Pipeline for Importing One Billion Records into MySQL

To import one billion 1 KB log records stored in HDFS or S3 into MySQL, first clarify constraints: each record is 1 KB, data is unstructured, stored in 100 split files, ordered import required, and the target DB is MySQL.

Single‑table feasibility : MySQL single tables are not recommended beyond ~20 million rows because the B+‑tree index depth grows, degrading insert performance. With a leaf page size of 16 KB, each leaf holds 16 rows, and a non‑leaf node can reference roughly 1,170 leaf nodes, limiting practical table size.

Batch insertion : Single‑row inserts are slow; batch sizes (e.g., 100 rows) should be tuned. InnoDB transactions guarantee atomic batch writes, and retries should be applied on failure. Prefer inserting rows in primary‑key order and avoid non‑primary indexes during bulk load; create them after the load if needed.

Storage‑engine selection : MyISAM offers higher raw insert speed but lacks transaction support, while InnoDB provides safety and comparable performance when innodb_flush_log_at_trx_commit is set to 0 or 2. The article recommends InnoDB unless the cluster disallows changing the flush policy.

Sharding considerations : A single MySQL instance caps at ~5 K TPS; SSDs perform better than HDDs for concurrent writes. Design should allow configurable numbers of databases and concurrent write tables, supporting both HDD and SSD environments.

File‑reading strategies : For 10 GB files, line‑by‑line reading using BufferedReader is sufficient (≈30 s) because the overall bottleneck is DB write speed. Java NIO FileChannel with ByteBuffer is faster for raw reads but requires manual line handling.

Task coordination : Each import task should have a unique primary‑key composed of task ID, file index, and line number to ensure idempotency. Progress can be tracked in Redis using INCRBY on a task‑specific key, with retries on both DB insert and Redis update.

Concurrency control : Directly writing from 100 parallel readers to the DB would overload the storage engine. Instead, limit concurrency using Redisson semaphores (one permit per DB) or a master‑node scheduler that assigns tasks, uses distributed locks, and renews them via Zookeeper+Curator to avoid lock expiration issues.

Overall recommendation : Combine reading and writing in the same task, tune batch sizes (e.g., 100, 1 000, 10 000), test InnoDB vs. MyISAM, adjust sharding granularity, and employ Redis for progress tracking and Zookeeper for leader election to achieve reliable, high‑throughput data import.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Redis Task scheduling Zookeeper mysql data sharding Batch Insert redisson

Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.