Big Data 8 min read

How to Deduplicate 4 Billion QQ Numbers Using Only 1 GB of Memory

This article walks through four practical techniques—sorting, hashmap, file splitting, and bitmap—to remove duplicate QQ numbers from a 4‑billion‑record file within a 1 GB memory limit, and provides extended exercises for sorting, median, top‑K, and duplicate detection.

NiuNiu MaTe
NiuNiu MaTe
NiuNiu MaTe
How to Deduplicate 4 Billion QQ Numbers Using Only 1 GB of Memory

Hello, I'm NiuNiu. Today we discuss a common interview problem that appears in Tencent's third‑round interview: given a file containing 4 billion QQ numbers, design an algorithm to deduplicate them while keeping only one copy of each, with a memory limit of 1 GB.

Method 1: Sorting

The simplest idea is to sort all QQ numbers; duplicates become adjacent, so we keep the first occurrence and discard the rest.

Original QQ numbers:

Sorted QQ numbers:

Deduplicated result:

However, sorting's time complexity is too high to pass the interview.

Method 2: HashMap

Using a hashmap to record each QQ number eliminates duplicates automatically.

mapFlag[123] = true
mapFlag[567] = true
mapFlag[123] = true
mapFlag[890] = true

After inserting, the hashmap contains only unique keys:

mapFlag[123] = true
mapFlag[567] = true
mapFlag[890] = true

But storing 4 billion entries exceeds the 1 GB memory limit.

Method 3: File Splitting

For massive data, one may split the file into smaller chunks and process each separately, using external merge sort or bucket sort. Yet this still incurs high I/O overhead and may not meet the interview constraints.

Method 4: Bitmap

Optimizing the hashmap idea, we can use a bitmap to represent the existence of each possible QQ number.

An unsigned char (8 bits) can represent numbers 0‑7; a value of 255 means all eight numbers exist, while 254 indicates numbers 1‑7 exist but 0 does not.

Similarly, an unsigned int (32 bits) can represent numbers 0‑31, and two unsigned ints can cover 0‑63.

Since a QQ number fits within 32 bits (max 2^32‑1 ≈ 4.3 billion), a bitmap of 512 MB is sufficient to mark the presence of all possible QQ numbers.

Implementation example:

bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[890] = 1

Scanning the bitmap from low to high and outputting indices with value 1 yields a deduplicated, sorted list, which satisfies the interview requirements.

Extended Exercise 1

Design an algorithm to sort 4 billion distinct QQ numbers with 1 GB memory. Using the bitmap to mark existence automatically yields a sorted output when scanning the bitmap.

Extended Exercise 2

Find the median of 4 billion distinct QQ numbers under the same memory constraint. The bitmap approach again provides a direct way to locate the middle element.

Extended Exercise 3

Retrieve the top‑K QQ numbers from the 4 billion distinct entries using 1 GB memory. After bitmap construction, scanning from the highest index and collecting the first K hits solves the problem.

Extended Exercise 4

Given 8 billion QQ numbers, determine whether any duplicates exist with 1 GB memory. By the pigeonhole principle, because the total possible distinct QQ numbers are about 4.3 billion, duplicates must exist.

algorithmbig dataBitmapDeduplication
NiuNiu MaTe
Written by

NiuNiu MaTe

Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.