Big Data 9 min read

How to Deduplicate 4 Billion QQ IDs Using a Bitmap Within 1 GB Memory

Learn how to efficiently remove duplicates from 4 billion QQ numbers using a memory‑friendly Bitmap approach that fits within a 1 GB limit, including calculations, step‑by‑step implementation, Java code, and a discussion of its advantages and drawbacks.

ITPUB
ITPUB
ITPUB
How to Deduplicate 4 Billion QQ IDs Using a Bitmap Within 1 GB Memory

Introduction

This article tackles a classic massive‑data deduplication problem: removing duplicates from 4 billion QQ numbers while staying under a strict 1 GB memory budget. It explores why a naïve HashSet solution fails and proposes a Bitmap‑based method.

Conventional Approach and Memory Issue

Developers often reach for HashSet to achieve deduplication:

Set<Long> qqSet = new HashSet<>();
qqSet.add(qqNumber); // automatically dedupes

However, storing 4 billion 64‑bit integers would require roughly 30 GB of RAM (or 15 GB if using 32‑bit integers), far exceeding the 1 GB limit.

Bitmap Data Structure

What is a Bitmap?

A Bitmap (位图) is a highly efficient data structure that uses a single bit to indicate the presence of a value. For a range of numbers, each possible value maps to one bit.

Example for a 10‑bit Bitmap representing numbers 0‑9:

If the 0th bit is 1, number 0 exists.

If the 1st bit is 1, number 1 exists.

If the 2nd bit is 1, number 2 exists.

Illustration:

Bitmap example for numbers 0‑9
Bitmap example for numbers 0‑9

For the QQ numbers 9, 5, 2, 7 the Bitmap would look like:

Bitmap representation of QQ numbers 9,5,2,7
Bitmap representation of QQ numbers 9,5,2,7

Memory Estimation for 4 Billion IDs

A Bitmap needs one bit per possible ID. For 4 billion IDs:

4,000,000,000 / 8 = 500,000,000 bytes
500,000,000 / 1024 / 1024 / 1024 ≈ 0.466 GB

Thus the Bitmap fits comfortably within the 1 GB constraint.

Deduplication Process Using Bitmap

Initialize a Bitmap with 4 billion bits (≈0.5 GB).

Iterate over each QQ number, map it to the corresponding bit position, and set that bit to 1.

After processing, scan the Bitmap and collect all indices whose bits are 1; these indices are the unique QQ numbers.

Example: For QQ number 326443281, set the bit at position 326443281 to 1.

Java Implementation

import java.util.*;

public class QQDeduplication {
    // Bitmap size = 2^32 bits ≈ 0.5 GB
    private static final long BITMAP_SIZE = 1L << 32; // 2^32
    private static final int BYTE_SIZE = 8;

    private static List<Long> deduplicateQQNumbers(long[] qqNumbers) {
        byte[] bitmap = new byte[(int) (BITMAP_SIZE / BYTE_SIZE)];
        for (long qqNumber : qqNumbers) {
            if (qqNumber >= 0 && qqNumber < BITMAP_SIZE) {
                int index = (int) (qqNumber / BYTE_SIZE);
                int bitPos = (int) (qqNumber % BYTE_SIZE);
                bitmap[index] |= (1 << bitPos);
            }
        }
        List<Long> unique = new ArrayList<>();
        for (int i = 0; i < bitmap.length; i++) {
            for (int j = 0; j < BYTE_SIZE; j++) {
                if ((bitmap[i] & (1 << j)) != 0) {
                    long qq = (long) i * BYTE_SIZE + j;
                    unique.add(qq);
                }
            }
        }
        return unique;
    }
}

Advantages and Disadvantages of Bitmap

Advantages

High space efficiency: only 1 bit per element.

O(1) insertion and query using bit operations.

Simplified deduplication logic—just set and check bits.

Disadvantages

Memory usage depends on the value range; sparse large ranges waste space (e.g., a trillion‑scale ID space would need >125 GB).

Cannot store additional metadata such as occurrence counts—only presence/absence.

Conclusion

Using a Bitmap allows deduplication of 4 billion QQ numbers within a sub‑gigabyte memory footprint, offering fast O(1) operations and dramatic storage savings compared to hash‑based structures. For scenarios with extremely sparse ID spaces, alternative techniques like Bloom filters may be considered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaBig DataMemory OptimizationBitmapdeduplicationData Structures
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.