How to Deduplicate 4 Billion QQ IDs Using a Bitmap Within 1 GB Memory
Learn how to efficiently remove duplicates from 4 billion QQ numbers using a memory‑friendly Bitmap approach that fits within a 1 GB limit, including calculations, step‑by‑step implementation, Java code, and a discussion of its advantages and drawbacks.
Introduction
This article tackles a classic massive‑data deduplication problem: removing duplicates from 4 billion QQ numbers while staying under a strict 1 GB memory budget. It explores why a naïve HashSet solution fails and proposes a Bitmap‑based method.
Conventional Approach and Memory Issue
Developers often reach for HashSet to achieve deduplication:
Set<Long> qqSet = new HashSet<>();
qqSet.add(qqNumber); // automatically dedupesHowever, storing 4 billion 64‑bit integers would require roughly 30 GB of RAM (or 15 GB if using 32‑bit integers), far exceeding the 1 GB limit.
Bitmap Data Structure
What is a Bitmap?
A Bitmap (位图) is a highly efficient data structure that uses a single bit to indicate the presence of a value. For a range of numbers, each possible value maps to one bit.
Example for a 10‑bit Bitmap representing numbers 0‑9:
If the 0th bit is 1, number 0 exists.
If the 1st bit is 1, number 1 exists.
If the 2nd bit is 1, number 2 exists.
Illustration:
For the QQ numbers 9, 5, 2, 7 the Bitmap would look like:
Memory Estimation for 4 Billion IDs
A Bitmap needs one bit per possible ID. For 4 billion IDs:
4,000,000,000 / 8 = 500,000,000 bytes
500,000,000 / 1024 / 1024 / 1024 ≈ 0.466 GBThus the Bitmap fits comfortably within the 1 GB constraint.
Deduplication Process Using Bitmap
Initialize a Bitmap with 4 billion bits (≈0.5 GB).
Iterate over each QQ number, map it to the corresponding bit position, and set that bit to 1.
After processing, scan the Bitmap and collect all indices whose bits are 1; these indices are the unique QQ numbers.
Example: For QQ number 326443281, set the bit at position 326443281 to 1.
Java Implementation
import java.util.*;
public class QQDeduplication {
// Bitmap size = 2^32 bits ≈ 0.5 GB
private static final long BITMAP_SIZE = 1L << 32; // 2^32
private static final int BYTE_SIZE = 8;
private static List<Long> deduplicateQQNumbers(long[] qqNumbers) {
byte[] bitmap = new byte[(int) (BITMAP_SIZE / BYTE_SIZE)];
for (long qqNumber : qqNumbers) {
if (qqNumber >= 0 && qqNumber < BITMAP_SIZE) {
int index = (int) (qqNumber / BYTE_SIZE);
int bitPos = (int) (qqNumber % BYTE_SIZE);
bitmap[index] |= (1 << bitPos);
}
}
List<Long> unique = new ArrayList<>();
for (int i = 0; i < bitmap.length; i++) {
for (int j = 0; j < BYTE_SIZE; j++) {
if ((bitmap[i] & (1 << j)) != 0) {
long qq = (long) i * BYTE_SIZE + j;
unique.add(qq);
}
}
}
return unique;
}
}Advantages and Disadvantages of Bitmap
Advantages
High space efficiency: only 1 bit per element.
O(1) insertion and query using bit operations.
Simplified deduplication logic—just set and check bits.
Disadvantages
Memory usage depends on the value range; sparse large ranges waste space (e.g., a trillion‑scale ID space would need >125 GB).
Cannot store additional metadata such as occurrence counts—only presence/absence.
Conclusion
Using a Bitmap allows deduplication of 4 billion QQ numbers within a sub‑gigabyte memory footprint, offering fast O(1) operations and dramatic storage savings compared to hash‑based structures. For scenarios with extremely sparse ID spaces, alternative techniques like Bloom filters may be considered.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
