Big Data 9 min read

How to Deduplicate 4 Billion QQ Numbers Using a Bitmap Under 1 GB

This article explains how to efficiently remove duplicates from 4 billion QQ numbers within a 1 GB memory limit by replacing the naïve HashSet approach with a memory‑saving Bitmap data structure, complete with calculations, algorithm steps, Java code, and a discussion of its pros and cons.

macrozheng
macrozheng
macrozheng
How to Deduplicate 4 Billion QQ Numbers Using a Bitmap Under 1 GB

1. Conventional Approach

In everyday development, the first thought for deduplication is to use a HashSet, which automatically removes duplicates:

Set<Long> qqSet = new HashSet<>();
qqSet.add(qqNumber); // automatic deduplication

However, with a 1 GB memory limit, storing 4 billion QQ numbers in a HashSet is infeasible. Assuming each QQ number occupies 8 bytes (64‑bit integer), the total memory required is about 30 GB, far exceeding the limit.

2. Bitmap Solution

2.1 What is a Bitmap?

A Bitmap is a highly efficient data structure that uses a single bit to indicate the presence of a number, making it ideal for large‑scale deduplication and query problems.

For example, a Bitmap of length 10 can represent numbers 0‑9, where each bit set to 1 means the corresponding number exists.

If the 0th bit is 1, number 0 exists.

If the 1st bit is 1, number 1 exists.

If the 2nd bit is 1, number 2 exists.

…and so on.

Images illustrate the bitmap representation:

When recording QQ numbers 9, 5, 2, 7, the bitmap looks like this:

Using a traditional 4‑byte integer for each number would require 16 bytes for four numbers, whereas the bitmap needs only 10 bits, demonstrating significant space savings.

2.2 Applying Bitmap to 4 Billion QQ Numbers

2.2.1 Memory Estimate

Creating a bitmap with 4 billion bits requires:

4,000,000,000 / 8 = 500,000,000 bytes
500,000,000 / 1024 / 1024 / 1024 ≈ 0.466 GB

This fits comfortably within the 1 GB limit.

2.2.2 Deduplication Process

Initialize a bitmap with 4 billion bits.

Iterate over each QQ number, map it to a bit position, and set that bit to 1.

For example, QQ number 326443281 sets the corresponding bit to 1.

Traverse the bitmap and collect all indices whose bits are 1; these indices correspond to the unique QQ numbers.

2.3 Simple Java Implementation

The following code demonstrates bitmap‑based deduplication:

import java.util.*;

public class QQDeduplication {
    // Bitmap size: 2^32 bits ≈ 0.5 GB
    private static final long BITMAP_SIZE = 1L << 32;
    private static final int BYTE_SIZE = 8; // bits per byte

    private static List<Long> deduplicateQQNumbers(long[] qqNumbers) {
        byte[] bitmap = new byte[(int) (BITMAP_SIZE / BYTE_SIZE)];
        for (long qqNumber : qqNumbers) {
            if (qqNumber >= 0 && qqNumber < BITMAP_SIZE) {
                int index = (int) (qqNumber / BYTE_SIZE);
                int bitPosition = (int) (qqNumber % BYTE_SIZE);
                bitmap[index] |= (1 << bitPosition);
            }
        }
        List<Long> uniqueQQNumbers = new ArrayList<>();
        for (int i = 0; i < bitmap.length; i++) {
            for (int j = 0; j < BYTE_SIZE; j++) {
                if ((bitmap[i] & (1 << j)) != 0) {
                    long qqNumber = (long) i * BYTE_SIZE + j;
                    uniqueQQNumbers.add(qqNumber);
                }
            }
        }
        return uniqueQQNumbers;
    }
}

2.4 Advantages and Disadvantages of Bitmap

Advantages

High space efficiency – only 1 bit per element.

Compared with a hash table, Bitmap uses just 1 bit per element, offering excellent utilization for dense data such as sequential QQ numbers.

Fast operations – insertion and query are O(1) using bitwise operations.

Both setting and checking a bit are constant‑time operations, suitable for real‑time processing of massive data.

Simplified deduplication logic – just set bits and later read the positions.

No complex data structures are needed; a single pass suffices.

Disadvantages

Memory consumption depends on the value range.

If the range is huge but sparse (e.g., QQ numbers up to 1 trillion), the bitmap would require prohibitive memory (≈125 GB).

Cannot store additional information.

Bitmap only records presence, not counts or other metadata.

Conclusion

While a Bloom filter could also achieve sub‑1 GB deduplication for 4 billion QQ numbers, the bitmap approach offers deterministic results and straightforward implementation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaMemory OptimizationBitmapdeduplication
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.