External Sorting of a 4.6 GB File Containing 500 Million Integers: Strategies, Implementations, and Performance
The article presents a practical case of sorting a 4.6 GB file with 500 million random integers, evaluates in‑memory quicksort and merge‑sort implementations, discusses bitmap sorting, and finally details a multi‑phase external‑sort algorithm with measured runtimes and resource considerations.
The problem is to sort a single file named bigdata (size 4663 MB, containing 5 × 10⁸ random integers, one per line).
Internal Sorting Attempts
Two in‑memory algorithms were tried.
Three‑Way Quicksort
private final int cutoff = 8;
public
void perform(Comparable
[] a) {
perform(a, 0, a.length - 1);
}
/* median‑of‑three, insertion‑sort fallback, ninther pivot selection, partition loop ... */The implementation switches to insertion sort for very small sub‑arrays and uses median‑of‑three or ninther pivot selection depending on the size.
Merge Sort
private final int cutoff = 8;
@Override
public
void perform(Comparable
[] a) {
Comparable
[] b = a.clone();
perform(b, a, 0, a.length - 1);
}
private
void perform(Comparable
[] src, Comparable
[] dest, int low, int high) {
if (low >= high) return;
if (high - low <= cutoff) {
SortFactory.createInsertionSort().perform(dest, low, high);
return;
}
int mid = low + ((high - low) >>> 1);
perform(dest, src, low, mid);
perform(dest, src, mid + 1, high);
if (lessThanOrEqual(src[mid], src[mid + 1])) {
System.arraycopy(src, low, dest, low, high - low + 1);
return;
}
merge(src, dest, low, mid, high);
}
private
void merge(Comparable
[] src, Comparable
[] dest, int low, int mid, int high) {
for (int i = low, v = low, w = mid + 1; i <= high; i++) {
if (w > high || (v <= mid && lessThanOrEqual(src[v], src[w]))) {
dest[i] = src[v++];
} else {
dest[i] = src[w++];
}
}
}Both approaches suffered from deep recursion and excessive memory copying, leading to stack overflow or out‑of‑memory errors on a file of this magnitude.
Using the System sort Command
sort -n bigdata -o bigdata.sortedThe built‑in Unix sort completed in about 24 minutes, but the author questioned the speed given the hardware constraints.
Bitmap Sort
private BitSet bits;
public void perform(String largeFileName, int total, String destLargeFileName,
Castor
castor, int readerBufferSize, int writerBufferSize, boolean asc) throws IOException {
System.out.println("BitmapSort Started.");
long start = System.currentTimeMillis();
bits = new BitSet(total);
// read input, set bits
// write output in ascending or descending order
long stop = System.currentTimeMillis();
System.out.println(String.format("BitmapSort Completed.elapsed : %dms", stop - start));
}
private void set(int i) { bits.set(i); }
private boolean get(int v) { return bits.get(v); }This method finished in 190 seconds (≈3 minutes) using roughly 4663 MB / 32 ≈ 146 MB of heap, but it relies on the ability to allocate a BitSet large enough to hold all possible values.
External Sort
When memory is insufficient, a classic external‑sort (divide‑and‑conquer) is employed:
1. Split Phase
A small in‑memory buffer ( memBuffer ) is filled from the source file. Once full, the buffer is sorted using an internal algorithm and written to a temporary sorted chunk file ( bigdata.xxx.part.sorted ). This repeats until the whole file is partitioned into n sorted runs.
2. Merge Phase
The n sorted runs are merged using a multi‑way merge: the smallest current element among all runs is selected (using a min‑heap or simple linear scan) and written to the final output, then the next element from that run is considered. The process continues until all runs are exhausted.
Example with three runs:
File1: 3,6,9
File2: 2,4,8
File3: 1,5,7
Round 1: pick min(1,2,3)=1 → output
Round 2: pick min(3,2,5)=2 → output
... and so on.The external sort completed in 771 seconds (≈13 minutes), producing a correctly ordered file.
Observations
In‑memory sorting of such a massive dataset is impractical due to stack depth and heap pressure.
System sort is reliable but not the fastest for this workload.
Bitmap sort is extremely fast when the value range fits into memory.
External sort provides a scalable solution for limited‑memory environments.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.