Big Data 8 min read

Sorting a 4.6 GB File with 500 Million Integers: Internal, Bitmap, and External Sorting Techniques

The article explains how to sort a massive 4.6 GB file containing 500 million random integers by first attempting in‑memory quicksort and merge sort, then using a bitmap approach, and finally applying an external sort that splits the data into manageable chunks and merges them efficiently.

Architect's Tech Stack
Architect's Tech Stack
Architect's Tech Stack
Sorting a 4.6 GB File with 500 Million Integers: Internal, Bitmap, and External Sorting Techniques

During an interview I was asked to sort a 4.6 GB file containing 500 million random integers, one per line.

First I tried internal sorting using a three‑way quicksort implementation in Java, with a cutoff to insertion sort for small sub‑arrays, and also a classic merge sort.

private final int cutoff = 8;
public
void perform(Comparable
[] a) {
    // quicksort implementation with cutoff
    // ...
}

These approaches quickly ran into stack‑overflow or out‑of‑memory problems because the whole file cannot be loaded into heap memory.

Next I implemented a bitmap sort that uses a BitSet to record the presence of each integer, reads the file sequentially, sets the corresponding bit, and finally writes the numbers back in order.

private BitSet bits;
public void perform(String largeFileName, int total, String destLargeFileName,
                    int readerBufferSize, int writerBufferSize, boolean asc) throws IOException {
    bits = new BitSet(total);
    // read each integer and set the bit
    Integer data;
    while ((data = largeIn.read()) != null) {
        bits.set(data);
    }
    // write out in order
    for (int i = 0; i < bits.size(); i++) {
        if (bits.get(i)) {
            largeOut.write(i);
        }
    }
}

The bitmap method finished in about 190 seconds (≈3 minutes) but still required a large amount of memory for the bit set.

Finally I applied external sorting: the file is split into many small sorted chunks that fit into a limited memory buffer, each chunk is sorted with the internal sorter and written to disk; then a multi‑way merge reads the heads of all chunks, repeatedly outputs the smallest current value, and refills the buffer from the corresponding chunk.

// Pseudo‑code for the merge phase
while (anyChunkHasData) {
    int min = min(currentHeadValues);
    output.write(min);
    advanceChunkContaining(min);
}

This external sort completed in about 771 seconds (≈13 minutes) and handled the full 5 billion‑number file without exceeding memory limits.

All code examples are written in Java and illustrate the trade‑offs between in‑memory, bitmap, and external sorting techniques for massive datasets.

Javaalgorithmbig databitmapsortingexternal sort
Architect's Tech Stack
Written by

Architect's Tech Stack

Java backend, microservices, distributed systems, containerized programming, and more.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.