How to Deduplicate 4 Billion QQ Numbers Using a Bitmap Under 1 GB
This article explains how to efficiently remove duplicates from 4 billion QQ numbers within a 1 GB memory limit by replacing the naïve HashSet approach with a memory‑saving Bitmap data structure, complete with calculations, algorithm steps, Java code, and a discussion of its pros and cons.
1. Conventional Approach
In everyday development, the first thought for deduplication is to use a HashSet, which automatically removes duplicates:
Set<Long> qqSet = new HashSet<>();
qqSet.add(qqNumber); // automatic deduplicationHowever, with a 1 GB memory limit, storing 4 billion QQ numbers in a HashSet is infeasible. Assuming each QQ number occupies 8 bytes (64‑bit integer), the total memory required is about 30 GB, far exceeding the limit.
2. Bitmap Solution
2.1 What is a Bitmap?
A Bitmap is a highly efficient data structure that uses a single bit to indicate the presence of a number, making it ideal for large‑scale deduplication and query problems.
For example, a Bitmap of length 10 can represent numbers 0‑9, where each bit set to 1 means the corresponding number exists.
If the 0th bit is 1, number 0 exists.
If the 1st bit is 1, number 1 exists.
If the 2nd bit is 1, number 2 exists.
…and so on.
Images illustrate the bitmap representation:
When recording QQ numbers 9, 5, 2, 7, the bitmap looks like this:
Using a traditional 4‑byte integer for each number would require 16 bytes for four numbers, whereas the bitmap needs only 10 bits, demonstrating significant space savings.
2.2 Applying Bitmap to 4 Billion QQ Numbers
2.2.1 Memory Estimate
Creating a bitmap with 4 billion bits requires:
4,000,000,000 / 8 = 500,000,000 bytes
500,000,000 / 1024 / 1024 / 1024 ≈ 0.466 GBThis fits comfortably within the 1 GB limit.
2.2.2 Deduplication Process
Initialize a bitmap with 4 billion bits.
Iterate over each QQ number, map it to a bit position, and set that bit to 1.
For example, QQ number 326443281 sets the corresponding bit to 1.
Traverse the bitmap and collect all indices whose bits are 1; these indices correspond to the unique QQ numbers.
2.3 Simple Java Implementation
The following code demonstrates bitmap‑based deduplication:
import java.util.*;
public class QQDeduplication {
// Bitmap size: 2^32 bits ≈ 0.5 GB
private static final long BITMAP_SIZE = 1L << 32;
private static final int BYTE_SIZE = 8; // bits per byte
private static List<Long> deduplicateQQNumbers(long[] qqNumbers) {
byte[] bitmap = new byte[(int) (BITMAP_SIZE / BYTE_SIZE)];
for (long qqNumber : qqNumbers) {
if (qqNumber >= 0 && qqNumber < BITMAP_SIZE) {
int index = (int) (qqNumber / BYTE_SIZE);
int bitPosition = (int) (qqNumber % BYTE_SIZE);
bitmap[index] |= (1 << bitPosition);
}
}
List<Long> uniqueQQNumbers = new ArrayList<>();
for (int i = 0; i < bitmap.length; i++) {
for (int j = 0; j < BYTE_SIZE; j++) {
if ((bitmap[i] & (1 << j)) != 0) {
long qqNumber = (long) i * BYTE_SIZE + j;
uniqueQQNumbers.add(qqNumber);
}
}
}
return uniqueQQNumbers;
}
}2.4 Advantages and Disadvantages of Bitmap
Advantages
High space efficiency – only 1 bit per element.
Compared with a hash table, Bitmap uses just 1 bit per element, offering excellent utilization for dense data such as sequential QQ numbers.
Fast operations – insertion and query are O(1) using bitwise operations.
Both setting and checking a bit are constant‑time operations, suitable for real‑time processing of massive data.
Simplified deduplication logic – just set bits and later read the positions.
No complex data structures are needed; a single pass suffices.
Disadvantages
Memory consumption depends on the value range.
If the range is huge but sparse (e.g., QQ numbers up to 1 trillion), the bitmap would require prohibitive memory (≈125 GB).
Cannot store additional information.
Bitmap only records presence, not counts or other metadata.
Conclusion
While a Bloom filter could also achieve sub‑1 GB deduplication for 4 billion QQ numbers, the bitmap approach offers deterministic results and straightforward implementation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
