Big Data 9 min read

How to Deduplicate 4 Billion QQ Numbers Using a Bitmap Under 1 GB

This article explains how to efficiently remove duplicates from 4 billion QQ numbers within a 1 GB memory limit by replacing the naïve HashSet approach with a memory‑saving Bitmap data structure, complete with calculations, algorithm steps, Java code, and a discussion of its pros and cons.

macrozheng
macrozheng
macrozheng
How to Deduplicate 4 Billion QQ Numbers Using a Bitmap Under 1 GB

1. Conventional Approach

In everyday development, the first thought for deduplication is to use a

HashSet

, which automatically removes duplicates:

<code>Set<Long> qqSet = new HashSet<>();
qqSet.add(qqNumber); // automatic deduplication</code>

However, with a 1 GB memory limit, storing 4 billion QQ numbers in a

HashSet

is infeasible. Assuming each QQ number occupies 8 bytes (64‑bit integer), the total memory required is about 30 GB, far exceeding the limit.

2. Bitmap Solution

2.1 What is a Bitmap?

A Bitmap is a highly efficient data structure that uses a single bit to indicate the presence of a number, making it ideal for large‑scale deduplication and query problems.

For example, a Bitmap of length 10 can represent numbers 0‑9, where each bit set to 1 means the corresponding number exists.

If the 0th bit is 1, number 0 exists.

If the 1st bit is 1, number 1 exists.

If the 2nd bit is 1, number 2 exists.

…and so on.

Images illustrate the bitmap representation:

When recording QQ numbers 9, 5, 2, 7, the bitmap looks like this:

Using a traditional 4‑byte integer for each number would require 16 bytes for four numbers, whereas the bitmap needs only 10 bits, demonstrating significant space savings.

2.2 Applying Bitmap to 4 Billion QQ Numbers

2.2.1 Memory Estimate

Creating a bitmap with 4 billion bits requires:

<code>4,000,000,000 / 8 = 500,000,000 bytes
500,000,000 / 1024 / 1024 / 1024 ≈ 0.466 GB</code>

This fits comfortably within the 1 GB limit.

2.2.2 Deduplication Process

Initialize a bitmap with 4 billion bits.

Iterate over each QQ number, map it to a bit position, and set that bit to 1.

For example, QQ number 326443281 sets the corresponding bit to 1.

Traverse the bitmap and collect all indices whose bits are 1; these indices correspond to the unique QQ numbers.

2.3 Simple Java Implementation

The following code demonstrates bitmap‑based deduplication:

<code>import java.util.*;

public class QQDeduplication {
    // Bitmap size: 2^32 bits ≈ 0.5 GB
    private static final long BITMAP_SIZE = 1L << 32;
    private static final int BYTE_SIZE = 8; // bits per byte

    private static List<Long> deduplicateQQNumbers(long[] qqNumbers) {
        byte[] bitmap = new byte[(int) (BITMAP_SIZE / BYTE_SIZE)];
        for (long qqNumber : qqNumbers) {
            if (qqNumber >= 0 && qqNumber < BITMAP_SIZE) {
                int index = (int) (qqNumber / BYTE_SIZE);
                int bitPosition = (int) (qqNumber % BYTE_SIZE);
                bitmap[index] |= (1 << bitPosition);
            }
        }
        List<Long> uniqueQQNumbers = new ArrayList<>();
        for (int i = 0; i < bitmap.length; i++) {
            for (int j = 0; j < BYTE_SIZE; j++) {
                if ((bitmap[i] & (1 << j)) != 0) {
                    long qqNumber = (long) i * BYTE_SIZE + j;
                    uniqueQQNumbers.add(qqNumber);
                }
            }
        }
        return uniqueQQNumbers;
    }
}
</code>

2.4 Advantages and Disadvantages of Bitmap

Advantages

High space efficiency – only 1 bit per element.

Compared with a hash table, Bitmap uses just 1 bit per element, offering excellent utilization for dense data such as sequential QQ numbers.

Fast operations – insertion and query are O(1) using bitwise operations.

Both setting and checking a bit are constant‑time operations, suitable for real‑time processing of massive data.

Simplified deduplication logic – just set bits and later read the positions.

No complex data structures are needed; a single pass suffices.

Disadvantages

Memory consumption depends on the value range.

If the range is huge but sparse (e.g., QQ numbers up to 1 trillion), the bitmap would require prohibitive memory (≈125 GB).

Cannot store additional information.

Bitmap only records presence, not counts or other metadata.

Conclusion

While a Bloom filter could also achieve sub‑1 GB deduplication for 4 billion QQ numbers, the bitmap approach offers deterministic results and straightforward implementation.

JavaBig DataMemory OptimizationBitMapdeduplication
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.