Cracking the 4‑Billion QQ Deduplication Challenge with 1 GB Memory
This article walks through four approaches—sorting, hashmap, file splitting, and a bitmap technique—to deduplicate 4 billion QQ numbers within a 1 GB memory limit, explains why the first three fail, and shows how a bitmap solves the problem efficiently.
Problem Statement
The interview question asks to deduplicate 4 billion QQ numbers stored in a file while using no more than 1 GB of RAM. The naive solutions either exceed time or memory limits, so a clever approach is required.
Method 1: Sorting
Sorting the entire list makes duplicate numbers adjacent, allowing a simple linear pass to keep the first occurrence and discard the rest. However, sorting 4 billion items requires far more than the allowed memory and time, making it unsuitable for the interview.
Method 2: HashMap
Inserting each QQ number as a key in a hashmap automatically removes duplicates. Example code:
mapFlag[123] = true
mapFlag[567] = true
mapFlag[123] = true
mapFlag[890] = trueAfter the hashmap eliminates duplicates, the stored keys are:
mapFlag[123] = true
mapFlag[567] = true
mapFlag[890] = trueAlthough conceptually simple, a hashmap for 4 billion entries would require far more than 1 GB of memory, so this method also fails the memory constraint.
Method 3: File Splitting
Dividing the massive file into smaller chunks enables processing each chunk within memory limits, often combined with external merge‑sort or bucket‑sort. While this reduces memory usage, the multiple file I/O operations and additional sorting steps make the overall runtime too high for the interview scenario.
Method 4: Bitmap (Bit‑set) Optimization
The most effective solution uses a bitmap to represent the existence of each possible QQ number. Since QQ numbers are 32‑bit integers (max value 2³²‑1 ≈ 4.3 billion), a bitmap of 2³² bits requires 512 MB, well within the 1 GB limit.
Each bit corresponds to a specific QQ number; setting the bit to 1 marks the number as present. Example code:
bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[123] = 1
bitmapFlag[890] = 1After processing the entire file, the bitmap automatically contains only unique numbers:
bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[890] = 1Scanning the bitmap from low to high indices and outputting indices with a set bit yields the deduplicated list. This approach satisfies both time and memory constraints and would pass the Tencent interview.
Extension Exercises
Exercise 1 – Sorting
Using the same bitmap, iterating over all possible QQ numbers in order naturally produces a sorted list, because the bitmap’s index order corresponds to numeric order.
Exercise 2 – Median
After building the bitmap, a single linear scan counting set bits can locate the median position without extra memory.
Exercise 3 – Top‑K
Similarly, scanning the bitmap from the highest index downward and collecting the first K set bits yields the top‑K QQ numbers.
Exercise 4 – Duplicate Detection for 8 Billion Numbers
With 8 billion entries, the pigeonhole principle guarantees duplicates because the total number of possible distinct QQ numbers (~4.3 billion) is smaller than the entry count. Thus, a simple bitmap scan can immediately confirm the existence of duplicates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
