Big Data 10 min read

External Sorting of a 4.6 GB File with 500 Million Integers: Strategies and Implementation

This article explains how to sort a 4.6 GB file containing 500 million random integers using internal quicksort and merge sort attempts, the Unix sort command, a bitmap-based method, and a detailed external sorting strategy with multi‑way merge, discussing performance and resource constraints.

Architecture Digest
Architecture Digest
Architecture Digest
External Sorting of a 4.6 GB File with 500 Million Integers: Strategies and Implementation

The problem presents a 4.6 GB file named bigdata containing 500 million random integers, each on a separate line, and asks how to sort it.

Initial attempts at internal sorting include a three‑way quicksort implementation and a merge sort, both with cutoff thresholds for small sub‑arrays and using insertion sort for those cases. Code snippets illustrate the algorithms.

private final int cutoff = 8;

public <T> void perform(Comparable<T>[] a) {
    perform(a,0,a.length - 1);
}

private <T> int median3(Comparable<T>[] a,int x,int y,int z) {
    if(lessThan(a[x],a[y])) {
        if(lessThan(a[y],a[z])) {
            return y;
        } else if(lessThan(a[x],a[z])) {
            return z;
        } else {
            return x;
        }
    } else {
        if(lessThan(a[z],a[y])){
            return y;
        } else if(lessThan(a[z],a[x])) {
            return z;
        } else {
            return x;
        }
    }
}

private <T> void perform(Comparable<T>[] a,int low,int high) {
    int n = high - low + 1;
    if(n <= cutoff) {
        InsertionSort insertionSort = SortFactory.createInsertionSort();
        insertionSort.perform(a,low,high);
    } else if(n <= 100) {
        int m = median3(a,low,low + (n >>> 1),high);
        exchange(a,m,low);
    } else {
        int gap = n >>> 3;
        int m = low + (n >>> 1);
        int m1 = median3(a,low,low + gap,low + (gap << 1));
        int m2 = median3(a,m - gap,m,m + gap);
        int m3 = median3(a,high - (gap << 1),high - gap,high);
        int ninther = median3(a,m1,m2,m3);
        exchange(a,ninther,low);
    }
    if(high <= low) return;
    int lt = low;
    int gt = high;
    Comparable<T> pivot = a[low];
    int i = low + 1;
    while(i <= gt) {
        if(lessThan(a[i],pivot)) {
            exchange(a,lt++,i++);
        } else if(lessThan(pivot,a[i])) {
            exchange(a,i,gt--);
        } else {
            i++;
        }
    }
    perform(a,low,lt - 1);
    perform(a,gt + 1,high);
}

Running the Unix sort -n bigdata -o bigdata.sorted command completes in about 24 minutes, but the author notes high I/O, memory pressure, and CPU time spent on context switches.

A bitmap‑based sorting method is then described, using a BitSet to record the presence of each integer, reading the large file into memory, setting bits, and then writing out the sorted numbers in ascending or descending order. This approach finishes in roughly 190 seconds on a machine with 4.6 GB/32 GB memory.

private BitSet bits;

public void perform(String largeFileName,int total,String destLargeFileName,Castor<Integer> castor,int readerBufferSize,int writerBufferSize,boolean asc) throws IOException {
    System.out.println("BitmapSort Started.");
    long start = System.currentTimeMillis();
    bits = new BitSet(total);
    InputPart<Integer> largeIn = PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize);
    OutputPart<Integer> largeOut = PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize);
    largeOut.delete();
    Integer data;
    int off = 0;
    while(true) {
        data = largeIn.read();
        if(data == null) break;
        int v = data;
        set(v);
        off++;
    }
    largeIn.close();
    int size = bits.size();
    System.out.println(String.format("lines : %d ,bits : %d", off, size));
    if(asc) {
        for(int i = 0; i < size; i++) {
            if(get(i)) {
                largeOut.write(i);
            }
        }
    } else {
        for(int i = size - 1; i >= 0; i--) {
            if(get(i)) {
                largeOut.write(i);
            }
        }
    }
    largeOut.close();
    long stop = System.currentTimeMillis();
    long elapsed = stop - start;
    System.out.println(String.format("BitmapSort Completed.elapsed : %dms",elapsed));
} 

private void set(int i) { bits.set(i); }
private boolean get(int v) { return bits.get(v); }

Finally, an external sorting strategy is detailed for environments with limited memory. The file is processed in chunks: each chunk is read into a small buffer, internally sorted, and written to temporary sorted part files. Afterwards, a multi‑way merge reads the smallest elements from each part file and writes them to the final sorted output. The complete external sort runs in about 13 minutes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

merge sortbitmap sortexternal sort
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.