Big Data 9 min read

How to De‑duplicate 4 Billion QQ Numbers with Only 1 GB RAM

Learn four practical techniques—simple sorting, hashmap deduplication, external merge sort, and bitmap bit‑set optimization—to efficiently remove duplicate QQ numbers from a 40‑billion‑record file while staying within a strict 1 GB memory limit, even handling tighter 100 MB constraints.

NiuNiu MaTe
NiuNiu MaTe
NiuNiu MaTe
How to De‑duplicate 4 Billion QQ Numbers with Only 1 GB RAM

Hello, I'm Niu Ge. Today we discuss a common interview problem that also appears in Tencent's third-round interview: given a file containing 4 billion QQ numbers, design an algorithm to deduplicate them, keeping only one copy of each, under a 1 GB memory limit.

Method 1: Sorting

The simplest approach is to sort all QQ numbers; duplicates become adjacent, so we keep the first occurrence and discard the rest.

Original QQ numbers:

Sorted QQ numbers:

After removing adjacent duplicates the result is:

However, interviewers often ask for a faster solution than plain sorting.

Method 2: HashMap

Using a hashmap, we record each QQ number as a key; the hashmap’s inherent deduplication leaves only unique entries.

mapFlag[123] = true</code><code>mapFlag[567] = true</code><code>mapFlag[123] = true</code><code>mapFlag[890] = true

After inserting all numbers, the hashmap contains:

mapFlag[123] = true</code><code>mapFlag[567] = true</code><code>mapFlag[890] = true

This yields the deduplicated set {123, 567, 890}. Yet a hashmap for 4 billion entries exceeds the 1 GB memory limit.

Method 3: External Sorting

External sorting handles data far larger than memory. Using the Unix command sort -un qq.txt (‑u for unique, ‑n for numeric), we perform a divide‑and‑conquer multi‑way merge.

Steps:

Split the 40 billion‑record file into 100 MB chunks (≈320 files). Sort each chunk individually.

Merge the sorted chunks using a min‑heap that holds the smallest element from each file.

During the final merge, write out each number only if it differs from the previously written one, achieving deduplication.

While correct, this approach involves many file operations and may still be considered inefficient in an interview.

Method 4: Bitmap

A bitmap (bit‑set) can represent the existence of each possible QQ number using a single bit. Since QQ numbers fit within a 32‑bit range (≈4.3 billion), a bitmap of 2^32 bits requires 512 MB, which fits within the 1 GB limit.

Illustration of an unsigned char (8 bits) representing numbers 0‑7:

Similarly, an unsigned int (32 bits) can represent numbers 0‑31, and two unsigned ints can cover 0‑63.

Using a bitmap array:

bitmapFlag[123] = 1</code><code>bitmapFlag[567] = 1</code><code>bitmapFlag[890] = 1

Scanning the bitmap from low to high yields all existing QQ numbers, automatically deduplicated. This method comfortably passes the interview constraints.

When Memory Is Only 100 MB

If the available memory drops to 100 MB, a full bitmap no longer fits. We apply divide‑and‑conquer: use the highest 3 bits of each QQ number to partition the data into 8 groups (each ≈5 hundred million numbers, requiring ≈62.5 MB bitmap). Process each group separately with a bitmap, then concatenate the results.

This demonstrates that massive‑data problems require tailored solutions rather than a one‑size‑fits‑all approach.

Practice these techniques to improve your problem‑solving skills for technical interviews.

algorithmBig DataBitmapDeduplicationexternal sort
NiuNiu MaTe
Written by

NiuNiu MaTe

Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.