How to De‑duplicate 4 Billion QQ Numbers with Only 1 GB RAM
This article explains several algorithmic strategies—including sorting, hash maps, file splitting, and bitmap techniques—to remove duplicates from a file containing 4 billion QQ numbers while staying within a 1 GB memory limit, and it provides extension exercises for sorting, median, top‑K, and duplicate detection.
In a Tencent interview scenario, you may be asked to deduplicate 4 billion QQ numbers using only 1 GB of memory. The following methods illustrate how to solve this massive‑data problem.
Method One: Sorting
The most straightforward approach is to sort all QQ numbers; duplicate entries become adjacent, so keeping the first and discarding the rest yields a deduplicated list. However, sorting 4 billion items exceeds the time limits of the interview.
After sorting, duplicates can be removed easily, but the algorithm’s O(n log n) time complexity is too high for the interview constraints.
Method Two: HashMap
Using a hash map to record each QQ number’s presence reduces the problem to O(n) time. Each number is inserted as a key with a boolean flag:
mapFlag[123] = true
mapFlag[567] = true
mapFlag[123] = true
mapFlag[890] = trueBecause a hash map stores only unique keys, the resulting structure contains:
mapFlag[123] = true
mapFlag[567] = true
mapFlag[890] = trueWhile this method achieves deduplication, allocating a hash map for 4 billion entries far exceeds the 1 GB memory limit.
Method Three: File Splitting
Dividing the massive file into smaller chunks and applying external merge sort or bucket sort can avoid memory overflow. Nevertheless, the required sorting step still incurs high time cost, making it unsuitable for the interview.
Method Four: Bitmap
A bitmap (bit array) offers a space‑efficient solution. An unsigned char (8 bits) can represent the existence of numbers 0–7; an unsigned int (32 bits) covers 0–31, and two unsigned ints cover 0–63. Since the theoretical maximum QQ number is 2³²‑1 (~4.3 billion), a bitmap of 512 MB is sufficient to mark every possible QQ number.
Setting a bit to 1 indicates the corresponding QQ number exists:
bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[890] = 1Scanning the bitmap from low to high indices and outputting numbers whose bit is 1 automatically yields a deduplicated, sorted list, satisfying both time and memory constraints.
Extended Exercise 1: Sorting 4 Billion Distinct QQ Numbers
Using the bitmap to mark presence also sorts the numbers: after marking, a linear scan outputs them in ascending order.
Extended Exercise 2: Finding the Median
Because the bitmap provides a sorted order, counting bits until reaching the middle index yields the median without extra memory.
Extended Exercise 3: Top‑K Query
Scanning the bitmap from the highest index downward and collecting the first K set bits returns the top‑K QQ numbers.
Extended Exercise 4: Detecting Duplicates in 8 Billion Numbers
Since the total number of possible QQ values (~4.3 billion) is less than 8 billion, the pigeonhole principle guarantees at least one duplicate; a bitmap can confirm this quickly.
These techniques illustrate how to handle massive data deduplication, sorting, and related queries within strict memory limits.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
