Finding the Median of 10 GB of Integers with a 2 GB Memory Limit
The article outlines a multi‑pass bucket‑sorting strategy to locate the median of a 10 GB unordered integer file using only 2 GB of RAM, detailing how to partition data into byte‑wise buckets, compute the median’s bucket, and finalize the result with in‑memory sorting.
Problem: A file contains 10 GB of unordered integers and the task is to find the median using only 2 GB of RAM; only the algorithmic idea is required.
Median definition: after sorting, the median is the middle value; for an even number of samples it is the average of the N/2‑th and (N/2+1)‑th elements, so for 10 GB of numbers the median lies between the 5 GB‑th and (5 GB+1)‑th largest values.
Analysis: The simplest approach is sorting, and a byte‑wise bucket sort is feasible.
Idea: Treat each byte of a 32‑bit integer as a key (four keys per integer). Higher‑order bytes determine larger numbers, and comparison follows lexical order.
Step 1: Read 2 GB of the data into memory at a time, iterating over the 536,870,912 integers. Extract the most significant 8 bits (bits 31‑24) using a right‑shift operation. These bits (0‑255) define up to 256 buckets; numbers are written to corresponding disk files while maintaining a counter for each bucket (only 256 integers needed in memory).
Cost: Unavoidable I/O to read the 10 GB, O(n) linear traversal in memory, and an additional pass to write the 256 buckets back to disk, roughly doubling the data transfer.
Step 2: Using the bucket counts, determine which bucket contains the median. For 2,684,354,560 numbers the median position is 1,342,177,280. If the cumulative count of the first 127 buckets is below this position and adding bucket 128 exceeds it, the median resides in bucket 128. Its exact offset is the median position minus the sum of the first 127 buckets. Read bucket 128 (approximately 80 MB, rarely exceeding 2 GB) into memory.
Cost: O(M) accumulation of bucket counts (M < 255) and an I/O read of the ~80 MB file.
Note: In pathological cases where bucket 128 exceeds 2 GB, it can be processed in batches as in Step 1.
Step 3: Perform another bucket sort on the next byte (bits 23‑16) within memory, again using 256 buckets.
Step 4: Continue the process down to the least‑significant byte (bits 7‑0). At that point the remaining data can be sorted in memory with a quicksort.
The overall algorithm runs in linear O(n) time without nested loops; the dominant cost is the second memory‑disk data exchange in Step 1. After Step 2, if the median‑containing bucket fits into memory, a simple quicksort completes the task.
Welcome to like, collect, and share the post.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
