How to Deduplicate 4 Billion QQ Numbers Using Only 1 GB of Memory
This article walks through four practical techniques—sorting, hashmap, file splitting, and bitmap—to remove duplicate QQ numbers from a 4‑billion‑record file within a 1 GB memory limit, and provides extended exercises for sorting, median, top‑K, and duplicate detection.
Hello, I'm NiuNiu. Today we discuss a common interview problem that appears in Tencent's third‑round interview: given a file containing 4 billion QQ numbers, design an algorithm to deduplicate them while keeping only one copy of each, with a memory limit of 1 GB.
Method 1: Sorting
The simplest idea is to sort all QQ numbers; duplicates become adjacent, so we keep the first occurrence and discard the rest.
Original QQ numbers:
Sorted QQ numbers:
Deduplicated result:
However, sorting's time complexity is too high to pass the interview.
Method 2: HashMap
Using a hashmap to record each QQ number eliminates duplicates automatically.
mapFlag[123] = true mapFlag[567] = true mapFlag[123] = true mapFlag[890] = trueAfter inserting, the hashmap contains only unique keys:
mapFlag[123] = true mapFlag[567] = true mapFlag[890] = trueBut storing 4 billion entries exceeds the 1 GB memory limit.
Method 3: File Splitting
For massive data, one may split the file into smaller chunks and process each separately, using external merge sort or bucket sort. Yet this still incurs high I/O overhead and may not meet the interview constraints.
Method 4: Bitmap
Optimizing the hashmap idea, we can use a bitmap to represent the existence of each possible QQ number.
An unsigned char (8 bits) can represent numbers 0‑7; a value of 255 means all eight numbers exist, while 254 indicates numbers 1‑7 exist but 0 does not.
Similarly, an unsigned int (32 bits) can represent numbers 0‑31, and two unsigned ints can cover 0‑63.
Since a QQ number fits within 32 bits (max 2^32‑1 ≈ 4.3 billion), a bitmap of 512 MB is sufficient to mark the presence of all possible QQ numbers.
Implementation example:
bitmapFlag[123] = 1 bitmapFlag[567] = 1 bitmapFlag[890] = 1Scanning the bitmap from low to high and outputting indices with value 1 yields a deduplicated, sorted list, which satisfies the interview requirements.
Extended Exercise 1
Design an algorithm to sort 4 billion distinct QQ numbers with 1 GB memory. Using the bitmap to mark existence automatically yields a sorted output when scanning the bitmap.
Extended Exercise 2
Find the median of 4 billion distinct QQ numbers under the same memory constraint. The bitmap approach again provides a direct way to locate the middle element.
Extended Exercise 3
Retrieve the top‑K QQ numbers from the 4 billion distinct entries using 1 GB memory. After bitmap construction, scanning from the highest index and collecting the first K hits solves the problem.
Extended Exercise 4
Given 8 billion QQ numbers, determine whether any duplicates exist with 1 GB memory. By the pigeonhole principle, because the total possible distinct QQ numbers are about 4.3 billion, duplicates must exist.
NiuNiu MaTe
Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
