Fundamentals 8 min read

Cracking the 4‑Billion QQ Deduplication Challenge with 1 GB Memory

This article walks through four approaches—sorting, hashmap, file splitting, and a bitmap technique—to deduplicate 4 billion QQ numbers within a 1 GB memory limit, explains why the first three fail, and shows how a bitmap solves the problem efficiently.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Cracking the 4‑Billion QQ Deduplication Challenge with 1 GB Memory

Problem Statement

The interview question asks to deduplicate 4 billion QQ numbers stored in a file while using no more than 1 GB of RAM. The naive solutions either exceed time or memory limits, so a clever approach is required.

Method 1: Sorting

Sorting the entire list makes duplicate numbers adjacent, allowing a simple linear pass to keep the first occurrence and discard the rest. However, sorting 4 billion items requires far more than the allowed memory and time, making it unsuitable for the interview.

Method 2: HashMap

Inserting each QQ number as a key in a hashmap automatically removes duplicates. Example code:

mapFlag[123] = true
mapFlag[567] = true
mapFlag[123] = true
mapFlag[890] = true

After the hashmap eliminates duplicates, the stored keys are:

mapFlag[123] = true
mapFlag[567] = true
mapFlag[890] = true

Although conceptually simple, a hashmap for 4 billion entries would require far more than 1 GB of memory, so this method also fails the memory constraint.

Method 3: File Splitting

Dividing the massive file into smaller chunks enables processing each chunk within memory limits, often combined with external merge‑sort or bucket‑sort. While this reduces memory usage, the multiple file I/O operations and additional sorting steps make the overall runtime too high for the interview scenario.

Method 4: Bitmap (Bit‑set) Optimization

The most effective solution uses a bitmap to represent the existence of each possible QQ number. Since QQ numbers are 32‑bit integers (max value 2³²‑1 ≈ 4.3 billion), a bitmap of 2³² bits requires 512 MB, well within the 1 GB limit.

Each bit corresponds to a specific QQ number; setting the bit to 1 marks the number as present. Example code:

bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[123] = 1
bitmapFlag[890] = 1

After processing the entire file, the bitmap automatically contains only unique numbers:

bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[890] = 1

Scanning the bitmap from low to high indices and outputting indices with a set bit yields the deduplicated list. This approach satisfies both time and memory constraints and would pass the Tencent interview.

Extension Exercises

Exercise 1 – Sorting

Using the same bitmap, iterating over all possible QQ numbers in order naturally produces a sorted list, because the bitmap’s index order corresponds to numeric order.

Exercise 2 – Median

After building the bitmap, a single linear scan counting set bits can locate the median position without extra memory.

Exercise 3 – Top‑K

Similarly, scanning the bitmap from the highest index downward and collecting the first K set bits yields the top‑K QQ numbers.

Exercise 4 – Duplicate Detection for 8 Billion Numbers

With 8 billion entries, the pigeonhole principle guarantees duplicates because the total number of possible distinct QQ numbers (~4.3 billion) is smaller than the entry count. Thus, a simple bitmap scan can immediately confirm the existence of duplicates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

algorithmBig DataMemory OptimizationBitmapdeduplication
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.