Databases 19 min read

Efficient Strategies for Importing One Billion Records into MySQL

This article explains how to import 1 billion 1 KB log records stored in HDFS or S3 into MySQL by analyzing single‑table limits, using batch inserts, choosing storage engines, sharding, optimizing file‑reading methods, and coordinating distributed tasks with Redis, Redisson, and Zookeeper to ensure ordered, reliable, and high‑throughput data loading.

Code Ape Tech Column

May 9, 2025

Efficient Strategies for Importing One Billion Records into MySQL

Before designing a solution, clarify the constraints: 1 billion rows, each 1 KB, unstructured logs stored in HDFS or S3, split into 100 files, ordered import required, and the target database is MySQL.

Can a Single MySQL Table Hold 1 Billion Rows?

No. A single table is recommended to stay below 20 million rows. The limit is derived from B+‑tree index depth: 3‑level indexes support roughly 20 million rows, while 4‑level indexes degrade performance.

Leaf nodes are 16 KB, each storing 16 rows (1 KB per row). Non‑leaf nodes also 16 KB and can hold about 1 170 pointers (14 bytes each). This yields the following capacity table:

Level

Maximum Rows

11 720

21 902 400 ≈ 20 M

25 625 808 000 ≈ 2.5 B

Therefore, split the data into 100 tables (≈10 M rows each) to stay within safe limits.

Efficient Writing to the Database

Single‑row inserts are slow; use batch inserts (e.g., 100 rows per batch). InnoDB guarantees transactional atomicity for batches. Enable retries; after N failed attempts, fall back to single‑row inserts and log failures.

Write in primary‑key order for best performance; avoid non‑primary indexes during bulk load, or create them after the load.

Concurrency on a Single Table

Concurrent writes to the same table break ordering; instead, increase the batch size to raise effective concurrency without parallel writes.

Choosing the MySQL Storage Engine

MyISAM

offers higher raw insert speed but lacks transactions, making it unsuitable for reliable batch loads. InnoDB with innodb_flush_log_at_trx_commit set to 0 or 2 can achieve comparable performance while preserving ACID guarantees.

Should We Shard the Database?

Single‑instance MySQL typically caps at ~5 K TPS. SSDs handle concurrent writes better than HDDs; however, HDDs suffer from head contention when multiple tables are written simultaneously. Design a flexible configuration to adjust the number of databases and concurrent tables based on the underlying storage.

Fast File Reading Techniques

Reading a 10 GB file cannot be done in memory. Benchmarks on macOS show:

Method

Result Files.readAllBytes OOM FileReader+BufferedReader 11 s File+BufferedReader 10 s Scanner 57 s Java NIO FileChannel 3 s

Although NIO is fastest, it reads fixed‑size buffers and does not align with line boundaries. BufferedReader provides line‑wise reading with acceptable performance (~30 s for 10 GB) and is sufficient because the bottleneck lies in database writes.

Coordinating Read and Write Tasks

Running 100 read‑write tasks in parallel would overload the database. Instead, each task reads a batch, then writes it, limiting the number of concurrent write streams. Kafka was considered for decoupling but discarded due to ordering and partitioning complexities.

Ensuring Task Reliability

Use a composite primary key (file suffix + line number) to guarantee idempotent inserts. Record task progress in Redis with INCRBY on a per‑task key; on failure, retry the batch or fall back to single inserts, updating Redis after each successful batch.

Controlling Concurrency with Redisson

Redisson semaphores can limit the number of simultaneous write tasks, but they lack lease renewal. A workaround is to elect a master node (via Zookeeper+Curator) that assigns tasks, acquires a distributed lock with renewal, and releases it upon completion.

Summary

Confirm constraints before designing the solution.

Split data into multiple databases/tables based on scale.

Analyze single‑database write bottlenecks to decide on sharding.

Adjust concurrent write degrees according to SSD/HDD characteristics.

Compare InnoDB and MyISAM storage engines in production.

Determine optimal batch‑insert size through testing.

Combine read and write tasks to avoid Kafka‑related ordering issues.

Persist task progress in Redis to achieve exactly‑once semantics.

Use a master node with Zookeeper+Curator for reliable task distribution and Redisson locks for exclusive execution.

Finally, the author asks readers to like, share, and follow for more content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Redis kafka Performance Tuning mysql Batch Insert data import

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Can a Single MySQL Table Hold 1 Billion Rows?

Efficient Writing to the Database

Concurrency on a Single Table

Choosing the MySQL Storage Engine

Should We Shard the Database?

Fast File Reading Techniques

Coordinating Read and Write Tasks

Ensuring Task Reliability

Controlling Concurrency with Redisson

Summary

Code Ape Tech Column

How this landed with the community

Was this worth your time?

0 Comments

Can a Single MySQL Table Hold 1 Billion Rows?