How to De‑duplicate 1 Billion QQ Numbers with Bitmap, Bloom Filter, and Distributed Solutions
This article explores multiple techniques—including bitmap indexing, Bloom filters, external sorting, Spark, Redis, and a hierarchical bitmap architecture—to efficiently deduplicate ten‑hundred‑million QQ numbers while balancing memory usage, accuracy, and processing speed.
Introduction
Recently a question appeared online: how to deduplicate 1 billion QQ numbers? This article shares several practical solutions ranging from single‑machine bitmap methods to large‑scale distributed architectures.
Technical Challenges
Data Scale Analysis
Raw data : 1e9 × 8 bytes = 8 GB
HashSet deduplication : at least 16 GB memory (Java object overhead)
Ideal solution : less than 1 GB memory
Core Challenges
Single‑Machine Solution: Bitmap
Algorithm Principle
Use a bit array to record the existence of each number.
public class BitMap { private final byte[] bits; public BitMap(int maxNum) { this.bits = new byte[(maxNum >> 3) + 1]; } public void add(int num) { int arrayIndex = num >> 3; int position = num & 0x07; bits[arrayIndex] |= 1 << position; } public boolean contains(int num) { int arrayIndex = num >> 3; int position = num & 0x07; return (bits[arrayIndex] & (1 << position)) != 0; } }QQ Number Range Optimization
QQ range: 10000 (5‑digit) – 9999999999 (10‑digit).
Bitmap memory calculation: (10^10 − 10^4) / 8 / 1024 / 1024 ≈ 1.16 GB.
Optimization: store (qq − 10000) as offset.
public void add(long qq) { long num = qq - 10000; int arrayIndex = (int)(num >> 3); int position = (int)(num & 0x07); bits[arrayIndex] |= 1 << position; }Advanced Solution: Bloom Filter
Handling Memory Limits
A Bloom filter provides probabilistic deduplication with controllable false‑positive rates, dramatically reducing memory consumption.
public class BloomFilter { private final BitSet bitset; private final int size; private final int[] seeds; public BloomFilter(int size, int hashCount) { this.bitset = new BitSet(size); this.size = size; this.seeds = new int[hashCount]; for (int i = 0; i < hashCount; i++) { seeds[i] = i * 31; } } public void add(String qq) { for (int seed : seeds) { int hash = hash(qq, seed); bitset.set(Math.abs(hash % size), true); } } public boolean contains(String qq) { for (int seed : seeds) { int hash = hash(qq, seed); if (!bitset.get(Math.abs(hash % size))) return false; } return true; } private int hash(String value, int seed) { int result = 0; for (char c : value.toCharArray()) { result = seed * result + c; } return result; } }Memory Impact
Original storage 8 GB, bitmap 1.16 GB, Bloom filter (0.1 % false‑positive) ≈ 171 MB.
Disk‑Based Solution: External Sort & Multi‑Way Merge
Processing Flow
Key Code
public void externalSort(String input, String output) throws IOException { List<File> chunks = splitAndSort(input, 100_000_000); // each file 10 million records mergeFiles(chunks, output); } void mergeFiles(List<File> files, String output) { PriorityQueue<MergeEntry> queue = new PriorityQueue<>(); List<BufferedReader> readers = new ArrayList<>(); for (File file : files) { BufferedReader reader = new BufferedReader(new FileReader(file)); readers.add(reader); String line = reader.readLine(); if (line != null) { queue.add(new MergeEntry(line, reader)); } } try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) { long last = -1; while (!queue.isEmpty()) { MergeEntry entry = queue.poll(); long qq = Long.parseLong(entry.value); if (qq != last) { writer.write(entry.value); writer.newLine(); last = qq; } String next = entry.reader.readLine(); if (next != null) { queue.add(new MergeEntry(next, entry.reader)); } } } finally { readers.forEach(r -> { try { r.close(); } catch (IOException e) {} }); } } class MergeEntry implements Comparable<MergeEntry> { String value; BufferedReader reader; public MergeEntry(String value, BufferedReader reader) { this.value = value; this.reader = reader; } @Override public int compareTo(MergeEntry o) { return this.value.compareTo(o.value); } }Distributed Solution
Shard Strategy Design
Spark Implementation
val qqRDD = spark.read.textFile("hdfs://qq_data/*.txt") .map(_.toLong) .repartition(1000) // 1000 partitions val distinctRDD = qqRDD.mapPartitions { iter => val bitmap = new RoaringBitmap() iter.foreach(qq => bitmap.add(qq.toInt)) bitmap.iterator.asScala.map(_.toLong) } val globalDistinct = distinctRDD.distinct() globalDistinct.saveAsTextFile("hdfs://result/")Memory Optimization: RoaringBitmap
Ordinary bitmap 1.16 GB, RoaringBitmap (sparse data) 100‑300 MB.
Production‑Grade Architecture: Lambda Architecture
System Diagram
Layer Technology Choices
Batch layer: Spark + HDFS (full deduplication). Speed layer: Flink + Redis (real‑time incremental deduplication). Service layer: Spring Boot + HBase (unified query API).
Ultimate Solution: Hierarchical Bitmap Index
Design
Three‑level partition: 100 top‑level intervals (each 100 million), each split into 100 sub‑intervals (each 1 million), each stored in a RoaringBitmap (~1.2 MB).
Memory Requirement
100 × 100 × 1.2 MB ≈ 12 GB, feasible with distributed storage.
Java Implementation
public class LayeredBitmap { private final RoaringBitmap[][][] bitmaps; private static final int L1 = 100; private static final int L2 = 100; public LayeredBitmap() { bitmaps = new RoaringBitmap[L1][L2][]; } public void add(long qq) { int l1Index = (int)((qq - 10000) / 100_000_000); long remainder = (qq - 10000) % 100_000_000; int l2Index = (int)(remainder / 1_000_000); int value = (int)(remainder % 1_000_000); if (bitmaps[l1Index][l2Index] == null) { bitmaps[l1Index][l2Index] = new RoaringBitmap(); } bitmaps[l1Index][l2Index].add(value); } public boolean contains(long qq) { int l1Index = (int)((qq - 10000) / 100_000_000); long remainder = (qq - 10000) % 100_000_000; int l2Index = (int)(remainder / 1_000_000); int value = (int)(remainder % 1_000_000); RoaringBitmap bitmap = bitmaps[l1Index][l2Index]; return bitmap != null && bitmap.contains(value); } }Comparison and Recommendation
• Single‑machine bitmap: O(1) time, 100 % accuracy, ~1.16 GB memory. • Bloom filter: O(1) time, <0.1 % false‑positive, ~171 MB memory. • External sort: O(n log n) time, 100 % accuracy, requires disk space. • Spark distributed: O(n) time, 100 % accuracy, uses cluster storage. • Redis real‑time: O(1) time, 100 % accuracy, in‑memory. • Hierarchical bitmap: O(1) time, 100 % accuracy, compressed ~12 GB, suitable for ultra‑large scale.
Practical Experience and Pitfalls
Data Skew Solution
int getShardId(long qq, int shardCount) { String str = String.valueOf(qq); int suffix = Integer.parseInt(str.substring(Math.max(0, str.length() - 6))); return suffix % shardCount; }Deduplication Accuracy Guarantee
Cost‑Optimization Tips
Hot‑cold separation: keep hot data in memory bitmap, cold data on disk.
Compressed storage: replace plain bitmap with RoaringBitmap.
Tiered storage: recent three months in memory, older data in HBase with compression.
Conclusion
Deduplicating 1 billion QQ numbers boils down to breaking the problem into manageable units, using space‑for‑time trade‑offs, probabilistic structures, hierarchical design, and separating batch from real‑time processing to achieve high efficiency and accuracy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
