Big Data 8 min read

How to De‑duplicate 4 Billion QQ Numbers with 1 GB Memory: 4 Proven Techniques

This article explains four practical methods—sorting, hash map, file splitting, and bitmap—to deduplicate 4 billion QQ numbers within a 1 GB memory limit, and provides extended exercises on sorting, finding the median, top‑K, and duplicate detection for massive datasets.

Java Backend Technology
Java Backend Technology
Java Backend Technology
How to De‑duplicate 4 Billion QQ Numbers with 1 GB Memory: 4 Proven Techniques

Today we discuss a common interview question that appears in Tencent's third‑round interview: given a file containing 4 billion QQ numbers, design an algorithm to deduplicate them with only 1 GB of memory.

Method 1: Sorting

The simplest approach is to sort all numbers; duplicates become adjacent, so we keep the first occurrence and discard the rest.

Original numbers (illustrated below):

Sorted numbers:

After sorting, deduplication is straightforward.

However, sorting the 4 billion entries would be too time‑consuming for the interview.

Method 2: HashMap

Use a hash map to record each QQ number; the map’s uniqueness property automatically removes duplicates.

mapFlag[123] = true
mapFlag[567] = true
mapFlag[123] = true
mapFlag[890] = true

After inserting all numbers, the map contains only unique keys:

mapFlag[123] = true
mapFlag[567] = true
mapFlag[890] = true

But storing 4 billion entries in a hash map exceeds the 1 GB memory limit.

Method 3: File Splitting

Divide the massive file into smaller chunks, sort each chunk (or use bucket sort), and then merge the results. This reduces memory usage but still incurs high I/O overhead and may not meet the interview’s efficiency expectations.

Method 4: Bitmap

Use a bitmap (bitset) to represent the existence of each possible QQ number (range 0‑2³²‑1). An unsigned char stores 8 bits, an unsigned int stores 32 bits, etc. With 512 MB we can represent all 4 billion possible numbers.

Set the bit corresponding to each QQ number to 1; after processing the file, scanning the bitmap from low to high yields the deduplicated (and sorted) list.

bitmapFlag[123] = 1
bitmapFlag[567] = 1
bitmapFlag[890] = 1

This method satisfies both time and space constraints and would pass the Tencent interview.

Extended Exercise 1

Given 4 billion distinct QQ numbers, design an algorithm to sort them with 1 GB memory. Using the bitmap approach automatically provides a sorted order.

Extended Exercise 2

Find the median of 4 billion distinct QQ numbers under the same memory limit. The bitmap can be scanned until the middle count is reached.

Extended Exercise 3

Find the top‑K QQ numbers. Again, the bitmap representation allows direct extraction of the largest K values.

Extended Exercise 4

Given 8 billion QQ numbers, determine whether any duplicates exist with 1 GB memory. Since the total possible distinct numbers are about 4.3 billion, the pigeonhole principle guarantees a duplicate.

In summary, choosing the right data structure—especially a bitmap—enables efficient deduplication, sorting, and related queries on massive datasets within strict memory limits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

algorithmBitmapdeduplicationinterviewlarge-scale data
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.