Databases 12 min read

How StarRocks Redesigns Bulk Import to Cut Small Files and Boost Throughput

This article explains how StarRocks mitigates the hidden risks of massive one‑time data imports in a storage‑compute separated architecture by redesigning the write path to spill to local disk, merge centrally, and write to object storage, resulting in fewer small files, higher write throughput, and more stable query performance.

StarRocks
StarRocks
StarRocks
How StarRocks Redesigns Bulk Import to Cut Small Files and Boost Throughput

Background

When users migrate large volumes of historical data to StarRocks in a storage‑compute separated architecture that relies on object storage, importing massive datasets in a single batch can cause severe performance degradation, a surge of tiny files, and reduced query speed.

Challenges in Storage‑Compute Separation

Historical data spans many tablets; under high‑concurrency import, memtables flush frequently, generating a flood of small files.

Users often start with a few small compute nodes (CNs), so limited CPU and memory exacerbate the “small memtable, frequent flush, tiny‑file accumulation” problem.

After bulk import, users may shrink the cluster, leaving many small files in object storage that are not promptly merged.

When queries run on these numerous small files, scan overhead dramatically reduces query performance.

Redesigning the Write Path

Write Stage – Spill to Local Disk First When a memtable fills, data is spilled to the CN’s local disk instead of being written directly to object storage. This avoids high‑latency remote I/O and prevents repeated sorting, encoding, and indexing work on partially formed data. If local disk space runs out, overflow data can be selectively written to S3.

Converge Stage – Centralized Merge Before Object‑Storage Write After all data for a bulk‑import job is written locally, the temporary spill files are merged into well‑structured, appropriately sized target files, which are then written to object storage in one batch.

The new bulk‑import flow is: Memory → Local Disk Spill → Centralized Merge → Object Storage .

Performance Evaluation

Test 1: Single‑Threaded 1 TB Load

Using Broker Load to import 1 TB (≈270 M rows) in a single task, the original path took about 2 h 15 min for the import stage and an additional 34 min for background compaction, totaling ~2 h 50 min before the system became query‑ready.

*************************** 3. row **************************
JobId: 10409
State: FINISHED
Type: BROKER
SinkRows: 270000000
LoadStartTime: 2024-12-27 10:59:12
LoadFinishTime: 2024-12-27 13:14:04

Compaction score after import: AvgCS ≈ 358, P50CS ≈ 299, MaxCS ≈ 1056.

After the optimization, total import time reduced to ~2 h 42 min, and the compaction score dropped to near‑optimal values (AvgCS ≈ 2.39, P50CS ≈ 2, MaxCS ≈ 5), eliminating the need for a heavy background compaction.

*************************** 2. row **************************
JobId: 10642
State: FINISHED
Type: BROKER
SinkRows: 270000000
LoadStartTime: 2024-12-27 16:14:08
LoadFinishTime: 2024-12-27 18:56:00

Test 2: 10‑Concurrent 100 GB Loads (Total 1 TB)

In the pre‑optimization run, the cluster could not finish within the 4‑hour timeout and the jobs were cancelled.

*************************** 10. row **************************
JobId: 11458
State: CANCELLED
Type: BROKER
Priority: NORMAL
ScanRows: 21905408
LoadStartTime: 2025-01-06 17:11:46
LoadFinishTime: 2025-01-06 21:11:44

After applying the new write path, all ten tasks completed in 25 minutes, with each task finishing in ~17 minutes.

*************************** 20. row **************************
JobId: 28336
State: FINISHED
Type: BROKER
Priority: NORMAL
ScanRows: 30000000
LoadStartTime: 2025-01-06 20:10:49
LoadFinishTime: 2025-01-06 20:27:59

Compaction score remained stable at AvgCS ≈ 10, P50CS ≈ 10, MaxCS ≈ 10, indicating that the merged files were already well‑balanced.

Key Benefits

Writes to S3 are dramatically reduced, increasing write throughput and producing larger, more storage‑efficient objects.

The import process fully utilizes local‑disk I/O, delivering noticeable performance gains.

Conclusion

By reworking the bulk‑import pipeline at the kernel level, StarRocks effectively eliminates the proliferation of tiny files that arise under memory‑constrained conditions in a storage‑compute separated setup. Users can achieve higher efficiency and more stable performance with lower resource investment when loading massive historical datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringcompactionStarRocksStorage Compute SeparationS3Bulk Import
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.