Big Data 11 min read

How to Sort a 4.6 GB File with 500 Million Numbers Efficiently

This article explores practical techniques for sorting a massive 4.6 GB file containing 500 million random integers, comparing internal quick‑sort and merge‑sort implementations, a bitmap‑based approach, and a full external‑sorting pipeline, while analyzing performance bottlenecks and memory constraints.

Programmer DD
Programmer DD
Programmer DD
How to Sort a 4.6 GB File with 500 Million Numbers Efficiently

Problem

Given a file named bigdata of size 4663 MB containing 500 million random integers (one per line), the task is to sort the file.

Internal Sorting Attempts

Two internal sorting strategies were tried. The first is a quick‑sort implementation that uses a cutoff of 8 for insertion sort, median‑of‑three, and a ninther pivot selection for larger partitions.

private final int cutoff = 8;

public void perform(Comparable[] a) {
    perform(a, 0, a.length - 1);
}

private int median3(Comparable[] a, int x, int y, int z) {
    if (lessThan(a[x], a[y])) {
        if (lessThan(a[y], a[z])) {
            return y;
        } else if (lessThan(a[x], a[z])) {
            return z;
        } else {
            return x;
        }
    } else {
        if (lessThan(a[z], a[y])) {
            return y;
        } else if (lessThan(a[z], a[x])) {
            return z;
        } else {
            return x;
        }
    }
}

private void perform(Comparable[] a, int low, int high) {
    int n = high - low + 1;
    if (n <= cutoff) {
        InsertionSort insertionSort = SortFactory.createInsertionSort();
        insertionSort.perform(a, low, high);
    } else if (n <= 100) {
        int m = median3(a, low, low + (n >>> 1), high);
        exchange(a, m, low);
    } else {
        int gap = n >>> 3;
        int m = low + (n >>> 1);
        int m1 = median3(a, low, low + gap, low + (gap << 1));
        int m2 = median3(a, m - gap, m, m + gap);
        int m3 = median3(a, high - (gap << 1), high - gap, high);
        int ninther = median3(a, m1, m2, m3);
        exchange(a, ninther, low);
    }
    if (high <= low) return;
    int lt = low;
    int gt = high;
    Comparable pivot = a[low];
    int i = low + 1;
    while (i <= gt) {
        if (lessThan(a[i], pivot)) {
            exchange(a, lt++, i++);
        } else if (lessThan(pivot, a[i])) {
            exchange(a, i, gt--);
        } else {
            i++;
        }
    }
    perform(a, low, lt - 1);
    perform(a, gt + 1, high);
}

The second is a merge‑sort implementation that falls back to insertion sort for small sub‑arrays.

private final int cutoff = 8;

@Override
public void perform(Comparable[] a) {
    Comparable[] b = a.clone();
    perform(b, a, 0, a.length - 1);
}

private void perform(Comparable[] src, Comparable[] dest, int low, int high) {
    if (low >= high) return;
    if (high - low <= cutoff) {
        SortFactory.createInsertionSort().perform(dest, low, high);
        return;
    }
    int mid = low + ((high - low) >>> 1);
    perform(dest, src, low, mid);
    perform(dest, src, mid + 1, high);
    if (lessThanOrEqual(src[mid], src[mid + 1])) {
        System.arraycopy(src, low, dest, low, high - low + 1);
        return;
    }
    merge(src, dest, low, mid, high);
}

private void merge(Comparable[] src, Comparable[] dest, int low, int mid, int high) {
    for (int i = low, v = low, w = mid + 1; i <= high; i++) {
        if (w > high || v <= mid && lessThanOrEqual(src[v], src[w])) {
            dest[i] = src[v++];
        } else {
            dest[i] = src[w++];
        }
    }
}

Both approaches suffer from deep recursion, high stack usage, and excessive memory copying, making them unsuitable for the full 500 million‑record file.

Using the System sort Command

Running the Unix sort utility on the file took about 24 minutes. The slowdown is attributed to limited memory, heavy I/O, frequent context switches, and cache pressure.

Bitmap Sort

A bitmap‑based algorithm was implemented using java.util.BitSet. It reads each integer, sets the corresponding bit, and then writes the numbers back in order (or reverse order). This method sorted the file in 190 seconds (≈3 minutes) but remains I/O‑bound.

private BitSet bits;

public void perform(String largeFileName, int total, String destLargeFileName,
                    Castor<Integer> castor, int readerBufferSize,
                    int writerBufferSize, boolean asc) throws IOException {
    System.out.println("BitmapSort Started.");
    long start = System.currentTimeMillis();
    bits = new BitSet(total);
    InputPart<Integer> largeIn = PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize);
    OutputPart<Integer> largeOut = PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize);
    largeOut.delete();
    Integer data;
    int off = 0;
    try {
        while (true) {
            data = largeIn.read();
            if (data == null) break;
            int v = data;
            set(v);
            off++;
        }
        largeIn.close();
        int size = bits.size();
        System.out.println(String.format("lines : %d ,bits : %d", off, size));
        if (asc) {
            for (int i = 0; i < size; i++) {
                if (get(i)) {
                    largeOut.write(i);
                }
            }
        } else {
            for (int i = size - 1; i >= 0; i--) {
                if (get(i)) {
                    largeOut.write(i);
                }
            }
        }
        largeOut.close();
        long stop = System.currentTimeMillis();
        long elapsed = stop - start;
        System.out.println(String.format("BitmapSort Completed.elapsed : %dms", elapsed));
    } finally {
        largeIn.close();
        largeOut.close();
    }
}

private void set(int i) {
    bits.set(i);
}

private boolean get(int v) {
    return bits.get(v);
}

External Sorting for Low‑Memory Environments

When memory is scarce, external sorting divides the file into small chunks, sorts each chunk in memory, and writes the sorted chunks to disk. Afterwards a multi‑way merge combines the chunks into a single sorted file.

Splitting phase : read the large file line‑by‑line into a small in‑memory buffer, sort the buffer, and write the sorted segment to a temporary file.

Merging phase : repeatedly select the smallest current element among the heads of all sorted chunks and write it to the output file.

Using this approach the 4.6 GB file was sorted in 771 seconds (≈13 minutes).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

merge sortbitmap sortexternal sorting
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.