Big Data 8 min read

Sorting a 4.6 GB File with 500 Million Integers: Internal, Bitmap, and External Sorting Techniques

The article explains how to sort a massive 4.6 GB file containing 500 million random integers by first attempting in‑memory quicksort and merge sort, then using a bitmap approach, and finally applying an external sort that splits the data into manageable chunks and merges them efficiently.

Architect's Tech Stack
Architect's Tech Stack
Architect's Tech Stack
Sorting a 4.6 GB File with 500 Million Integers: Internal, Bitmap, and External Sorting Techniques

During an interview I was asked to sort a 4.6 GB file containing 500 million random integers, one per line.

First I tried internal sorting using a three‑way quicksort implementation in Java, with a cutoff to insertion sort for small sub‑arrays, and also a classic merge sort.

private final int cutoff = 8;
public <T> void perform(Comparable<T>[] a) {
    // quicksort implementation with cutoff
    // ...
}

These approaches quickly ran into stack‑overflow or out‑of‑memory problems because the whole file cannot be loaded into heap memory.

Next I implemented a bitmap sort that uses a BitSet to record the presence of each integer, reads the file sequentially, sets the corresponding bit, and finally writes the numbers back in order.

private BitSet bits;
public void perform(String largeFileName, int total, String destLargeFileName,
                    int readerBufferSize, int writerBufferSize, boolean asc) throws IOException {
    bits = new BitSet(total);
    // read each integer and set the bit
    Integer data;
    while ((data = largeIn.read()) != null) {
        bits.set(data);
    }
    // write out in order
    for (int i = 0; i < bits.size(); i++) {
        if (bits.get(i)) {
            largeOut.write(i);
        }
    }
}

The bitmap method finished in about 190 seconds (≈3 minutes) but still required a large amount of memory for the bit set.

Finally I applied external sorting: the file is split into many small sorted chunks that fit into a limited memory buffer, each chunk is sorted with the internal sorter and written to disk; then a multi‑way merge reads the heads of all chunks, repeatedly outputs the smallest current value, and refills the buffer from the corresponding chunk.

// Pseudo‑code for the merge phase
while (anyChunkHasData) {
    int min = min(currentHeadValues);
    output.write(min);
    advanceChunkContaining(min);
}

This external sort completed in about 771 seconds (≈13 minutes) and handled the full 5 billion‑number file without exceeding memory limits.

All code examples are written in Java and illustrate the trade‑offs between in‑memory, bitmap, and external sorting techniques for massive datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaalgorithmBig DataBitmapSortingexternal sort
Architect's Tech Stack
Written by

Architect's Tech Stack

Java backend, microservices, distributed systems, containerized programming, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.