Big Data 8 min read

How to De‑duplicate 4 Billion QQ Numbers with Only 1 GB RAM

This article explains several algorithmic strategies—including sorting, hash maps, file splitting, and bitmap techniques—to remove duplicates from a file containing 4 billion QQ numbers while staying within a 1 GB memory limit, and it provides extension exercises for sorting, median, top‑K, and duplicate detection.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to De‑duplicate 4 Billion QQ Numbers with Only 1 GB RAM

In a Tencent interview scenario, you may be asked to deduplicate 4 billion QQ numbers using only 1 GB of memory. The following methods illustrate how to solve this massive‑data problem.

Method One: Sorting

The most straightforward approach is to sort all QQ numbers; duplicate entries become adjacent, so keeping the first and discarding the rest yields a deduplicated list. However, sorting 4 billion items exceeds the time limits of the interview.

After sorting, duplicates can be removed easily, but the algorithm’s O(n log n) time complexity is too high for the interview constraints.

Method Two: HashMap

Using a hash map to record each QQ number’s presence reduces the problem to O(n) time. Each number is inserted as a key with a boolean flag:

mapFlag[123] = true
mapFlag[567] = true
mapFlag[123] = true
mapFlag[890] = true

Because a hash map stores only unique keys, the resulting structure contains:

mapFlag[123] = true
mapFlag[567] = true
mapFlag[890] = true

While this method achieves deduplication, allocating a hash map for 4 billion entries far exceeds the 1 GB memory limit.

Method Three: File Splitting

Dividing the massive file into smaller chunks and applying external merge sort or bucket sort can avoid memory overflow. Nevertheless, the required sorting step still incurs high time cost, making it unsuitable for the interview.

Method Four: Bitmap

A bitmap (bit array) offers a space‑efficient solution. An unsigned char (8 bits) can represent the existence of numbers 0–7; an unsigned int (32 bits) covers 0–31, and two unsigned ints cover 0–63. Since the theoretical maximum QQ number is 2³²‑1 (~4.3 billion), a bitmap of 512 MB is sufficient to mark every possible QQ number.

Setting a bit to 1 indicates the corresponding QQ number exists:

bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[890] = 1

Scanning the bitmap from low to high indices and outputting numbers whose bit is 1 automatically yields a deduplicated, sorted list, satisfying both time and memory constraints.

Extended Exercise 1: Sorting 4 Billion Distinct QQ Numbers

Using the bitmap to mark presence also sorts the numbers: after marking, a linear scan outputs them in ascending order.

Extended Exercise 2: Finding the Median

Because the bitmap provides a sorted order, counting bits until reaching the middle index yields the median without extra memory.

Extended Exercise 3: Top‑K Query

Scanning the bitmap from the highest index downward and collecting the first K set bits returns the top‑K QQ numbers.

Extended Exercise 4: Detecting Duplicates in 8 Billion Numbers

Since the total number of possible QQ values (~4.3 billion) is less than 8 billion, the pigeonhole principle guarantees at least one duplicate; a bitmap can confirm this quickly.

These techniques illustrate how to handle massive data deduplication, sorting, and related queries within strict memory limits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

algorithmBig DataMemory OptimizationBitmapdeduplication
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.