High‑Performance Bulk Loading of Over 2 Billion Rows into MySQL Using XeLabs TokuDB
This article describes how to quickly import more than two billion rows from a big‑data source into MySQL by leveraging XeLabs TokuDB’s bulk‑loader, showing configuration, code examples, performance results, and practical recommendations for handling large‑scale data ingestion.
A friend needed to load over 2 billion rows received from a big‑data platform into MySQL for next‑day business reporting, requiring a fast ingestion method.
Typical InnoDB bulk inserts can reach 100‑150 k rows per second only when the dataset fits in memory; most real‑world cases exceed memory, so the author evaluated XeLabs TokuDB as an alternative.
XeLabs TokuDB (https://github.com/XeLabs/tokudb) adds several optimizations over the official version, including built‑in jemalloc memory allocation, extra performance metrics, Xtrabackup support, ZSTD compression, and the binlog_group_commit feature.
TokuDB core configuration: loose_tokudb_cache_size=4G loose_tokudb_directio=ON loose_tokudb_fsync_log_period=1000 tokudb_commit_sync=0
Table schema used for the test: CREATE TABLE `user_summary` ( `user_id` bigint(20) unsigned NOT NULL COMMENT '用户id/手机号', `weight` varchar(5) DEFAULT NULL COMMENT '和码体重(KG)', `level` varchar(20) DEFAULT NULL COMMENT '重量级', `beat_rate` varchar(12) DEFAULT NULL COMMENT '击败率', `level_num` int(10) DEFAULT NULL COMMENT '同吨位人数', UNIQUE KEY `u_user_id` (`user_id`) ) ENGINE=TokuDB DEFAULT CHARSET=utf8;
Data was loaded with MySQL’s LOAD DATA INFILE command:
root@localhost [zst]>LOAD DATA INFILE '/u01/work/134-136.txt' \
INTO TABLE user_summary(user_id, weight, level, beat_rate, level_num);
Query OK, 200000000 rows affected (5 min 48.30 sec)
Records: 200000000 Deleted: 0 Skipped: 0 Warnings: 0The resulting write speed was calculated as approximately 574 k rows per second:
root@localhost [zst]>select 200000000/(5*60+48.30);
+------------------------+
| 200000000/(5*60+48.30) |
+------------------------+
| 574217.6285 |
+------------------------+File size comparison showed the original 8.5 GB text file compressed to 3.5 GB in TokuDB (about 40 % of the original). In contrast, an equivalent InnoDB load produced files 3‑4 times larger (35 GB and 176 GB) and took significantly longer.
Additional tests demonstrated that using an auto‑increment primary key disables TokuDB’s bulk‑loader, causing the load time for 200 million rows to increase to over 22 minutes. The recommendation is to avoid auto‑increment columns for bulk loads, or replace them with a unique index to retain high performance.
Test environment: CentOS 7 VM with 8 CPU cores, 8 GB RAM, and a 500 GB high‑performance cloud disk; XeLabs TokuDB was compiled from source (binary available via Baidu Cloud).
Conclusion: In the described cloud environment, TokuDB can reliably achieve around 570 k rows per second for massive data loads, offering substantial speed and storage savings compared with InnoDB.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.