Interview Problem: Finding the Most Frequent Number in Billions of Integers with Limited Memory
The article discusses an interview scenario where a candidate must determine the most frequent integer among billions of values using only a few gigabytes of memory, exploring hash tables, file partitioning, and hashing strategies to handle massive data sets efficiently.
Recently, Xiao Qiu attended an interview after studying bitwise algorithm articles.
The interviewer presented a challenge: given 2 GB of memory and 2 billion 32‑bit integers, find the number that occurs most frequently.
Xiao Qiu initially suggested using a hash table where each integer is a key and its occurrence count is the value, then scanning the table for the maximum count.
When asked about memory usage, Xiao calculated that each hash entry requires 8 bytes (4 bytes for key + 4 bytes for value). In the worst case of 2 billion distinct numbers, the hash table would need about 16 GB, exceeding the 2 GB limit. The interviewer confirmed the memory constraint and hinted that the current approach would only store roughly 200 million distinct entries (about 1.6 GB).
Xiao then proposed splitting the 2 billion numbers into multiple files based on value ranges (e.g., 0‑200 million in file 1, 200 million‑400 million in file 2, etc.), resulting in about 21 files covering the full 32‑bit integer space.
Since identical numbers fall into the same file, each file can be processed independently using the original hash‑count method, and the global maximum can be selected from the per‑file results.
The interviewer asked how to handle a case where the numbers are densely packed (e.g., all between 1 and 20 million), which would map many values to a single file. Xiao suggested applying a good hash function before writing to files to distribute the numbers more evenly.
40‑billion‑scale extension
When the dataset is increased to 4 billion numbers, Xiao simply increased the number of files.
If all 4 billion numbers are identical, the hash table’s value field (int) would overflow (max ≈ 2.1 billion). Xiao’s workaround was to use a long counter or initialize the value to a negative sentinel (‑2.1 billion) to detect overflow.
80‑billion‑scale extension
For 8 billion numbers, Xiao proposed a one‑pass early‑exit strategy: while counting, if a key’s count exceeds 4 billion, no other key can surpass it, so the algorithm can return that key immediately.
Summary
The article illustrates several techniques for processing massive integer datasets under strict memory limits, including hash tables, file partitioning, hash‑based distribution, and early‑exit counting, and invites readers to share alternative or more optimal solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
