Big Data 10 min read

Sorting a 4.6GB File of 500M Numbers: Internal, Merge, Bitmap & External Techniques

This article explores how to sort a massive 4.6 GB file containing 500 million random integers by applying internal quicksort with median‑of‑three, merge sort, a bitmap‑based approach, and an external‑merge strategy, comparing their performance, memory usage, and implementation details in Java.

Java Backend Technology
Java Backend Technology
Java Backend Technology
Sorting a 4.6GB File of 500M Numbers: Internal, Merge, Bitmap & External Techniques

1. Problem

You are given a file bigdata of size 4663 MB containing 500 million random integers, one per line, and need to sort it.

Problem illustration
Problem illustration

2. Internal Sorting

Two internal sorting strategies are tried: a quicksort variant with a cutoff to insertion sort and a three‑way partition (median‑of‑three) to improve pivot selection.

private final int cutoff = 8;

public void perform(Comparable[] a) {
    perform(a, 0, a.length - 1);
}

private void perform(Comparable[] a, int low, int high) {
    int n = high - low + 1;
    if (n <= cutoff) {
        InsertionSort insertionSort = SortFactory.createInsertionSort();
        insertionSort.perform(a, low, high);
    } else if (n <= 100) {
        int m = median3(a, low, low + (n >>> 1), high);
        exchange(a, m, low);
    } else {
        int gap = n >>> 3;
        // further partitioning logic omitted for brevity
    }
    // recursive calls
    perform(a, low, lt - 1);
    perform(a, gt + 1, high);
}

3. Merge Sort

A classic top‑down merge sort is implemented, delegating to insertion sort for small sub‑arrays (cutoff = 8).

@Override
public void perform(Comparable[] a) {
    Comparable[] b = a.clone();
    perform(b, a, 0, a.length - 1);
}

private void perform(Comparable[] src, Comparable[] dest, int low, int high) {
    if (low >= high) return;
    if (high - low <= cutoff) {
        SortFactory.createInsertionSort().perform(dest, low, high);
        return;
    }
    int mid = low + ((high - low) >>> 1);
    perform(dest, src, low, mid);
    perform(dest, src, mid + 1, high);
    if (lessThanOrEqual(src[mid], src[mid + 1])) {
        System.arraycopy(src, low, dest, low, high - low + 1);
        return;
    }
    merge(src, dest, low, mid, high);
}

4. Using the Unix sort Command

Running the built‑in sort utility on the file took about 24 minutes, mainly due to heavy I/O, memory pressure, and frequent context switches.

5. Bitmap Method

A bitmap (BitSet) is used to record the presence of each integer, allowing linear‑time sorting when the value range is known.

private BitSet bits;

public void perform(String largeFileName, int total, String destLargeFileName,
                    int readerBufferSize, int writerBufferSize, boolean asc) throws IOException {
    long start = System.currentTimeMillis();
    bits = new BitSet(total);
    InputPart<Integer> largeIn = PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize);
    OutputPart<Integer> largeOut = PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize);
    largeOut.delete();
    Integer data;
    int off = 0;
    while (true) {
        data = largeIn.read();
        if (data == null) break;
        set(data);
        off++;
    }
    largeIn.close();
    int size = bits.size();
    System.out.println(String.format("lines : %d ,bits : %d", off, size));
    if (asc) {
        for (int i = 0; i < size; i++) {
            if (get(i)) largeOut.write(i);
        }
    } else {
        for (int i = size - 1; i >= 0; i--) {
            if (get(i)) largeOut.write(i);
        }
    }
    largeOut.close();
    long elapsed = System.currentTimeMillis() - start;
    System.out.println(String.format("BitmapSort Completed.elapsed : %dms", elapsed));
}

private void set(int i) { bits.set(i); }
private boolean get(int i) { return bits.get(i); }

The bitmap sort completed in 190 seconds (≈3 minutes) with most time spent on I/O.

6. External Sorting

When memory is insufficient, an external‑merge sort is applied: the large file is split into many sorted chunks that fit into a small in‑memory buffer, then these chunks are merged using a multi‑way merge.

External sort merge illustration
External sort merge illustration

The external sort finished in about 771 seconds (≈13 minutes).

Overall, the article demonstrates practical sorting techniques for massive datasets, evaluates their trade‑offs, and provides Java implementations for each method.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

algorithmSortingexternal sort
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.