Sorting a 4.6GB File of 500M Numbers: Internal, Merge, Bitmap & External Techniques
This article explores how to sort a massive 4.6 GB file containing 500 million random integers by applying internal quicksort with median‑of‑three, merge sort, a bitmap‑based approach, and an external‑merge strategy, comparing their performance, memory usage, and implementation details in Java.
1. Problem
You are given a file bigdata of size 4663 MB containing 500 million random integers, one per line, and need to sort it.
2. Internal Sorting
Two internal sorting strategies are tried: a quicksort variant with a cutoff to insertion sort and a three‑way partition (median‑of‑three) to improve pivot selection.
private final int cutoff = 8;
public void perform(Comparable[] a) {
perform(a, 0, a.length - 1);
}
private void perform(Comparable[] a, int low, int high) {
int n = high - low + 1;
if (n <= cutoff) {
InsertionSort insertionSort = SortFactory.createInsertionSort();
insertionSort.perform(a, low, high);
} else if (n <= 100) {
int m = median3(a, low, low + (n >>> 1), high);
exchange(a, m, low);
} else {
int gap = n >>> 3;
// further partitioning logic omitted for brevity
}
// recursive calls
perform(a, low, lt - 1);
perform(a, gt + 1, high);
}3. Merge Sort
A classic top‑down merge sort is implemented, delegating to insertion sort for small sub‑arrays (cutoff = 8).
@Override
public void perform(Comparable[] a) {
Comparable[] b = a.clone();
perform(b, a, 0, a.length - 1);
}
private void perform(Comparable[] src, Comparable[] dest, int low, int high) {
if (low >= high) return;
if (high - low <= cutoff) {
SortFactory.createInsertionSort().perform(dest, low, high);
return;
}
int mid = low + ((high - low) >>> 1);
perform(dest, src, low, mid);
perform(dest, src, mid + 1, high);
if (lessThanOrEqual(src[mid], src[mid + 1])) {
System.arraycopy(src, low, dest, low, high - low + 1);
return;
}
merge(src, dest, low, mid, high);
}4. Using the Unix sort Command
Running the built‑in sort utility on the file took about 24 minutes, mainly due to heavy I/O, memory pressure, and frequent context switches.
5. Bitmap Method
A bitmap (BitSet) is used to record the presence of each integer, allowing linear‑time sorting when the value range is known.
private BitSet bits;
public void perform(String largeFileName, int total, String destLargeFileName,
int readerBufferSize, int writerBufferSize, boolean asc) throws IOException {
long start = System.currentTimeMillis();
bits = new BitSet(total);
InputPart<Integer> largeIn = PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize);
OutputPart<Integer> largeOut = PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize);
largeOut.delete();
Integer data;
int off = 0;
while (true) {
data = largeIn.read();
if (data == null) break;
set(data);
off++;
}
largeIn.close();
int size = bits.size();
System.out.println(String.format("lines : %d ,bits : %d", off, size));
if (asc) {
for (int i = 0; i < size; i++) {
if (get(i)) largeOut.write(i);
}
} else {
for (int i = size - 1; i >= 0; i--) {
if (get(i)) largeOut.write(i);
}
}
largeOut.close();
long elapsed = System.currentTimeMillis() - start;
System.out.println(String.format("BitmapSort Completed.elapsed : %dms", elapsed));
}
private void set(int i) { bits.set(i); }
private boolean get(int i) { return bits.get(i); }The bitmap sort completed in 190 seconds (≈3 minutes) with most time spent on I/O.
6. External Sorting
When memory is insufficient, an external‑merge sort is applied: the large file is split into many sorted chunks that fit into a small in‑memory buffer, then these chunks are merged using a multi‑way merge.
The external sort finished in about 771 seconds (≈13 minutes).
Overall, the article demonstrates practical sorting techniques for massive datasets, evaluates their trade‑offs, and provides Java implementations for each method.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
