How to De‑duplicate 1 Billion QQ Numbers: Bitmap, Bloom Filter, and Distributed Strategies
This article explores multiple techniques for efficiently deduplicating ten‑hundred‑million QQ identifiers, covering single‑machine bitmap methods, Bloom filter trade‑offs, external sorting, Spark and Redis distributed solutions, and a layered bitmap index that balances memory usage, speed, and accuracy.
Introduction
Recently a question appeared online: how to deduplicate 1 billion QQ numbers? This article shares several common solutions and practical tips.
Technical Challenges
1.1 Data Scale Analysis
Original data : 1 billion × 8 bytes = 8 GB
HashSet deduplication : requires at least 16 GB memory (Java object overhead)
Ideal solution : less than 1 GB memory
1.2 Core Challenges
Solution I – Single‑Machine Bitmap Method
2.1 Algorithm Principle
Use a bit array to represent the existence of each number.
public class BitMap {
private final byte[] bits;
public BitMap(int maxNum) {
this.bits = new byte[(maxNum >> 3) + 1];
}
public void add(int num) {
int arrayIndex = num >> 3;
int position = num & 0x07;
bits[arrayIndex] |= 1 << position;
}
public boolean contains(int num) {
int arrayIndex = num >> 3;
int position = num & 0x07;
return (bits[arrayIndex] & (1 << position)) != 0;
}
}2.2 QQ Range Optimization
QQ range: 10000 (5 digits) – 9999999999 (10 digits).
Bitmap memory calculation : (10^10 – 10^4) / 8 / 1024 / 1024 ≈ 1.16 GB
Optimization – store (qq – 10000) to reduce range.
public void add(long qq) {
long num = qq - 10000;
int arrayIndex = (int)(num >> 3);
int position = (int)(num & 0x07);
bits[arrayIndex] |= 1 << position;
}Solution II – Bloom Filter
3.1 Memory Constraints
Bloom filter provides constant‑time checks with controllable false‑positive rate.
3.2 Parameter Design and Implementation
public class BloomFilter {
private final BitSet bitset;
private final int size;
private final int[] seeds;
public BloomFilter(int size, int hashCount) {
this.bitset = new BitSet(size);
this.size = size;
this.seeds = new int[hashCount];
for (int i = 0; i < hashCount; i++) {
seeds[i] = i * 31;
}
}
public void add(String qq) {
for (int seed : seeds) {
int hash = hash(qq, seed);
bitset.set(Math.abs(hash % size), true);
}
}
public boolean contains(String qq) {
for (int seed : seeds) {
int hash = hash(qq, seed);
if (!bitset.get(Math.abs(hash % size))) {
return false;
}
}
return true;
}
private int hash(String value, int seed) {
int result = 0;
for (char c : value.toCharArray()) {
result = seed * result + c;
}
return result;
}
}3.3 Memory‑Optimization Effect
Original storage: 8 GB, Bitmap: 1.16 GB, Bloom filter (0.1 % error): 171 MB.
Solution III – Disk‑Based External Sort and Multi‑Way Merge
4.1 Process Flow
4.2 Key Code
public void externalSort(String input, String output) throws IOException {
List<File> chunks = splitAndSort(input, 100_000_000);
mergeFiles(chunks, output);
}
void mergeFiles(List<File> files, String output) throws IOException {
PriorityQueue<MergeEntry> queue = new PriorityQueue<>();
List<BufferedReader> readers = new ArrayList<>();
for (File file : files) {
BufferedReader reader = new BufferedReader(new FileReader(file));
readers.add(reader);
String line = reader.readLine();
if (line != null) {
queue.add(new MergeEntry(line, reader));
}
}
try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) {
long last = -1;
while (!queue.isEmpty()) {
MergeEntry entry = queue.poll();
long qq = Long.parseLong(entry.value);
if (qq != last) {
writer.write(entry.value);
writer.newLine();
last = qq;
}
String next = entry.reader.readLine();
if (next != null) {
queue.add(new MergeEntry(next, entry.reader));
}
}
} finally {
readers.forEach(r -> {
try { r.close(); } catch (IOException e) {}
});
}
}
class MergeEntry implements Comparable<MergeEntry> {
String value;
BufferedReader reader;
MergeEntry(String value, BufferedReader reader) {
this.value = value;
this.reader = reader;
}
public int compareTo(MergeEntry o) {
return this.value.compareTo(o.value);
}
}Solution IV – Distributed Approaches
5.1 Sharding Strategy
5.2 Spark Implementation
val qqRDD = spark.read.textFile("hdfs://qq_data/*.txt")
.map(_.toLong)
.repartition(1000) // 1000 partitions
val distinctRDD = qqRDD.mapPartitions { iter =>
val bitmap = new RoaringBitmap()
iter.foreach(qq => bitmap.add(qq.toInt))
bitmap.iterator.asScala.map(_.toLong)
}
val globalDistinct = distinctRDD.distinct()
globalDistinct.saveAsTextFile("hdfs://result/")5.3 Memory Optimization – RoaringBitmap
Ordinary bitmap: ~1.16 GB; RoaringBitmap (sparse data): 100‑300 MB.
Solution V – Production‑Level Lambda Architecture
6.1 System Architecture Diagram
6.2 Technology Stack per Layer
Batch layer: Spark + HDFS (full‑batch deduplication) Speed layer: Flink + Redis (real‑time incremental deduplication) Service layer: Spring Boot + HBase (unified query API)
6.3 Real‑Time Deduplication Code
public class QQDeduplication {
private static final String REDIS_KEY = "qq_set";
public boolean isDuplicate(String qq) {
try (Jedis jedis = jedisPool.getResource()) {
// If estimated cardinality exceeds 1 billion, treat as duplicate
if (jedis.pfcount(REDIS_KEY) > 1_000_000_000L) {
return true;
}
return jedis.sadd(REDIS_KEY, qq) == 0;
}
}
}Solution VI – Ultimate Layered Bitmap Index
7.1 Architecture Design
7.2 Storage and Computation
First layer: 100 intervals, each 100 million IDs.
Second layer: 100 sub‑intervals per first‑layer interval, each 1 million IDs.
Third layer: RoaringBitmap per sub‑interval (≈1.2 MB).
Total memory ≈ 12 GB (feasible with distributed storage).
7.3 Java Implementation
public class LayeredBitmap {
private RoaringBitmap[][][] bitmaps;
private static final int L1 = 100;
private static final int L2 = 100;
public LayeredBitmap() {
bitmaps = new RoaringBitmap[L1][L2][];
}
public void add(long qq) {
int l1Index = (int)((qq - 10000) / 100_000_000);
long remainder = (qq - 10000) % 100_000_000;
int l2Index = (int)(remainder / 1_000_000);
int value = (int)(remainder % 1_000_000);
if (bitmaps[l1Index][l2Index] == null) {
bitmaps[l1Index][l2Index] = new RoaringBitmap();
}
bitmaps[l1Index][l2Index].add(value);
}
public boolean contains(long qq) {
int l1Index = (int)((qq - 10000) / 100_000_000);
long remainder = (qq - 10000) % 100_000_000;
int l2Index = (int)(remainder / 1_000_000);
int value = (int)(remainder % 1_000_000);
RoaringBitmap bitmap = bitmaps[l1Index][l2Index];
return bitmap != null && bitmap.contains(value);
}
}Comparison and Recommendation
Key metrics (memory, time complexity, accuracy) for each approach:
Solution | Scenario | Memory/Storage | Time Complexity | Accuracy
-----------------------------------------------------------------------------------------------
Single‑machine bitmap| < 100 M records | O(n) ~1 GB | O(1) | 100%
Bloom filter | Hundreds of billions (tolerate error) | O(1) | O(k) | <99.9% (configurable)
External sort | Single‑machine disk processing| Disk space | O(n log n) | 100%
Spark distributed | Massive batch processing | Cluster storage| O(n) | 100%
Redis real‑time | Incremental stream deduplication| O(n) | O(1) | 100%
Layered bitmap index| Ultra‑large precise deduplication| O(n) compressed| O(1) | 100%Practical Experience and Pitfalls
9.1 Data Skew Solution
int getShardId(long qq, int shardCount) {
String str = String.valueOf(qq);
int suffix = Integer.parseInt(str.substring(Math.max(0, str.length() - 6)));
return suffix % shardCount;
}9.2 Deduplication Accuracy Assurance
9.3 Cost‑Optimization Tips
Hot‑cold separation : keep hot data in memory bitmap, cold data on disk.
Compressed storage : use RoaringBitmap instead of plain bitmap.
Tiered storage : recent three months in memory, older data in HBase with compression.
Conclusion
Divide‑and‑conquer: split 1 billion IDs into manageable chunks.
Space‑for‑time trade‑off: bitmap gives O(1) checks at the cost of memory.
Probabilistic wisdom: Bloom filter reduces memory with acceptable false‑positive rate.
Layered design: billions → millions → thousands for scalable handling.
Batch‑plus‑real‑time separation: Spark for historical data, Redis/Flink for incremental streams.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
