How to De‑duplicate 4 Billion QQ Numbers with 1 GB Memory: 4 Proven Techniques
This article explains four practical methods—sorting, hash map, file splitting, and bitmap—to deduplicate 4 billion QQ numbers within a 1 GB memory limit, and provides extended exercises on sorting, finding the median, top‑K, and duplicate detection for massive datasets.
Today we discuss a common interview question that appears in Tencent's third‑round interview: given a file containing 4 billion QQ numbers, design an algorithm to deduplicate them with only 1 GB of memory.
Method 1: Sorting
The simplest approach is to sort all numbers; duplicates become adjacent, so we keep the first occurrence and discard the rest.
Original numbers (illustrated below):
Sorted numbers:
After sorting, deduplication is straightforward.
However, sorting the 4 billion entries would be too time‑consuming for the interview.
Method 2: HashMap
Use a hash map to record each QQ number; the map’s uniqueness property automatically removes duplicates.
mapFlag[123] = true
mapFlag[567] = true
mapFlag[123] = true
mapFlag[890] = trueAfter inserting all numbers, the map contains only unique keys:
mapFlag[123] = true
mapFlag[567] = true
mapFlag[890] = trueBut storing 4 billion entries in a hash map exceeds the 1 GB memory limit.
Method 3: File Splitting
Divide the massive file into smaller chunks, sort each chunk (or use bucket sort), and then merge the results. This reduces memory usage but still incurs high I/O overhead and may not meet the interview’s efficiency expectations.
Method 4: Bitmap
Use a bitmap (bitset) to represent the existence of each possible QQ number (range 0‑2³²‑1). An unsigned char stores 8 bits, an unsigned int stores 32 bits, etc. With 512 MB we can represent all 4 billion possible numbers.
Set the bit corresponding to each QQ number to 1; after processing the file, scanning the bitmap from low to high yields the deduplicated (and sorted) list.
bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[890] = 1This method satisfies both time and space constraints and would pass the Tencent interview.
Extended Exercise 1
Given 4 billion distinct QQ numbers, design an algorithm to sort them with 1 GB memory. Using the bitmap approach automatically provides a sorted order.
Extended Exercise 2
Find the median of 4 billion distinct QQ numbers under the same memory limit. The bitmap can be scanned until the middle count is reached.
Extended Exercise 3
Find the top‑K QQ numbers. Again, the bitmap representation allows direct extraction of the largest K values.
Extended Exercise 4
Given 8 billion QQ numbers, determine whether any duplicates exist with 1 GB memory. Since the total possible distinct numbers are about 4.3 billion, the pigeonhole principle guarantees a duplicate.
In summary, choosing the right data structure—especially a bitmap—enables efficient deduplication, sorting, and related queries on massive datasets within strict memory limits.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
