How Flink’s Sort‑Shuffle Boosts Large‑Scale Batch Processing Performance
This article explains how Flink’s new Sort‑Shuffle mechanism improves stability and performance for massive batch jobs by reducing file counts, optimizing I/O, minimizing memory usage, and providing detailed implementation, test results, tuning tips, and future enhancements.
1. Data Shuffle Overview
Data shuffle is a crucial stage in batch jobs where upstream operators write output to external storage and downstream operators read it for further processing, also serving as a recovery point.
2. Why Introduce Sort‑Shuffle
The previous hash‑based shuffle generated many files, causing high file‑system pressure, random I/O, and memory‑buffer consumption that limited stability and performance. Sort‑Shuffle reduces the number of files, opens fewer files, maximizes sequential reads/writes, lowers I/O amplification, and decouples memory buffers from parallelism.
3. Flink Sort‑Shuffle Implementation
3.1 Design Goals
Reduce file count
Open fewer files
Maximize sequential I/O
Reduce I/O amplification
Lower memory‑buffer consumption
3.2 Implementation Details
3.2.1 In‑memory Data Sorting
During the sort‑spill phase, each record is serialized and written to a sort buffer. When the buffer fills, all binary data are sorted by partition order and then flushed to a file. The buffer interface is:
public interface SortBuffer {
/** Appends data of the specified channel to this SortBuffer. */
boolean append(ByteBuffer source, int targetChannel, Buffer.DataType dataType) throws IOException;
BufferWithChannel copyIntoSegment(MemorySegment target);
long numRecords();
long numBytes();
boolean hasRemaining();
void finish();
boolean isFinished();
void release();
boolean isReleased();
}3.2.2 File Storage Structure
Each parallel task writes its shuffle data to a single physical file composed of multiple data regions. Within each region, records are ordered by partition number, enabling downstream tasks to read data sequentially per partition.
3.2.3 Index Files
For each data file there is an index file containing offsets and lengths for every partition in each data region, cached up to 4 MiB for fast lookup.
3.2.4 I/O Scheduling
Flink schedules reads using an elevator‑like algorithm that always serves the smallest file offset first, reducing random seeks.
for (data_region in data_regions) {
data_reader = poll_reader_of_the_smallest_file_offset(data_readers);
if (data_reader == null) break;
reading_buffers = request_reading_buffers();
if (reading_buffers.isEmpty()) break;
read_data(data_region, data_reader, reading_buffers);
}3.2.5 Broadcast Optimization
Broadcast data is stored only once in both the sort buffer and the shuffle file; all partition indexes point to the same block, greatly reducing broadcast overhead.
3.2.6 Data Compression
Compression is applied per network buffer per partition, improving TPC‑DS performance by over 30 %.
4. Test Results
4.1 Stability
Sort‑Shuffle eliminates file‑handle and inode exhaustion issues and resolves known Flink bugs (e.g., FLINK‑21201, FLINK‑19925), greatly improving batch job stability.
4.2 Performance
On a 1000‑parallelism cluster running TPC‑DS 10 TB, Sort‑Shuffle achieved 2‑6× overall speedup and up to 10× improvement for the shuffle phase alone.
5. Tuning Parameters
To enable Sort‑Shuffle set taskmanager.network.sort-shuffle.min-parallelism to a value lower than the expected number of partitions (e.g., 1 on HDD). Enable compression with taskmanager.network.blocking-shuffle.compression.enabled. Adjust buffer sizes via taskmanager.network.sort-shuffle.min-buffers to balance memory usage and throughput.
6. Future Work
Network connection reuse (FLINK‑22643, FLINK‑15455)
Multi‑disk load balancing (FLINK‑21790, FLINK‑21789)
Remote shuffle service
User‑selectable disk types (HDD vs SSD)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
