Big Data 12 min read

How to De‑duplicate 1 Billion QQ Numbers with Bitmap, Bloom Filter, and Distributed Solutions

This article explores multiple techniques—including bitmap indexing, Bloom filters, external sorting, Spark, Redis, and a hierarchical bitmap architecture—to efficiently deduplicate ten‑hundred‑million QQ numbers while balancing memory usage, accuracy, and processing speed.

Java Tech Enthusiast
Java Tech Enthusiast
Java Tech Enthusiast
How to De‑duplicate 1 Billion QQ Numbers with Bitmap, Bloom Filter, and Distributed Solutions

Introduction

Recently a question appeared online: how to deduplicate 1 billion QQ numbers? This article shares several practical solutions ranging from single‑machine bitmap methods to large‑scale distributed architectures.

Technical Challenges

Data Scale Analysis

Raw data : 1e9 × 8 bytes = 8 GB

HashSet deduplication : at least 16 GB memory (Java object overhead)

Ideal solution : less than 1 GB memory

Core Challenges

Single‑Machine Solution: Bitmap

Algorithm Principle

Use a bit array to record the existence of each number.

public class BitMap { private final byte[] bits; public BitMap(int maxNum) { this.bits = new byte[(maxNum >> 3) + 1]; } public void add(int num) { int arrayIndex = num >> 3; int position = num & 0x07; bits[arrayIndex] |= 1 << position; } public boolean contains(int num) { int arrayIndex = num >> 3; int position = num & 0x07; return (bits[arrayIndex] & (1 << position)) != 0; } }

QQ Number Range Optimization

QQ range: 10000 (5‑digit) – 9999999999 (10‑digit).

Bitmap memory calculation: (10^10 − 10^4) / 8 / 1024 / 1024 ≈ 1.16 GB.

Optimization: store (qq − 10000) as offset.

public void add(long qq) { long num = qq - 10000; int arrayIndex = (int)(num >> 3); int position = (int)(num & 0x07); bits[arrayIndex] |= 1 << position; }

Advanced Solution: Bloom Filter

Handling Memory Limits

A Bloom filter provides probabilistic deduplication with controllable false‑positive rates, dramatically reducing memory consumption.

public class BloomFilter { private final BitSet bitset; private final int size; private final int[] seeds; public BloomFilter(int size, int hashCount) { this.bitset = new BitSet(size); this.size = size; this.seeds = new int[hashCount]; for (int i = 0; i < hashCount; i++) { seeds[i] = i * 31; } } public void add(String qq) { for (int seed : seeds) { int hash = hash(qq, seed); bitset.set(Math.abs(hash % size), true); } } public boolean contains(String qq) { for (int seed : seeds) { int hash = hash(qq, seed); if (!bitset.get(Math.abs(hash % size))) return false; } return true; } private int hash(String value, int seed) { int result = 0; for (char c : value.toCharArray()) { result = seed * result + c; } return result; } }

Memory Impact

Original storage 8 GB, bitmap 1.16 GB, Bloom filter (0.1 % false‑positive) ≈ 171 MB.

Disk‑Based Solution: External Sort & Multi‑Way Merge

Processing Flow

Key Code

public void externalSort(String input, String output) throws IOException { List<File> chunks = splitAndSort(input, 100_000_000); // each file 10 million records mergeFiles(chunks, output); }
void mergeFiles(List<File> files, String output) { PriorityQueue<MergeEntry> queue = new PriorityQueue<>(); List<BufferedReader> readers = new ArrayList<>(); for (File file : files) { BufferedReader reader = new BufferedReader(new FileReader(file)); readers.add(reader); String line = reader.readLine(); if (line != null) { queue.add(new MergeEntry(line, reader)); } } try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) { long last = -1; while (!queue.isEmpty()) { MergeEntry entry = queue.poll(); long qq = Long.parseLong(entry.value); if (qq != last) { writer.write(entry.value); writer.newLine(); last = qq; } String next = entry.reader.readLine(); if (next != null) { queue.add(new MergeEntry(next, entry.reader)); } } } finally { readers.forEach(r -> { try { r.close(); } catch (IOException e) {} }); } }
class MergeEntry implements Comparable<MergeEntry> { String value; BufferedReader reader; public MergeEntry(String value, BufferedReader reader) { this.value = value; this.reader = reader; } @Override public int compareTo(MergeEntry o) { return this.value.compareTo(o.value); } }

Distributed Solution

Shard Strategy Design

Spark Implementation

val qqRDD = spark.read.textFile("hdfs://qq_data/*.txt") .map(_.toLong) .repartition(1000) // 1000 partitions val distinctRDD = qqRDD.mapPartitions { iter => val bitmap = new RoaringBitmap() iter.foreach(qq => bitmap.add(qq.toInt)) bitmap.iterator.asScala.map(_.toLong) } val globalDistinct = distinctRDD.distinct() globalDistinct.saveAsTextFile("hdfs://result/")

Memory Optimization: RoaringBitmap

Ordinary bitmap 1.16 GB, RoaringBitmap (sparse data) 100‑300 MB.

Production‑Grade Architecture: Lambda Architecture

System Diagram

Layer Technology Choices

Batch layer: Spark + HDFS (full deduplication). Speed layer: Flink + Redis (real‑time incremental deduplication). Service layer: Spring Boot + HBase (unified query API).

Ultimate Solution: Hierarchical Bitmap Index

Design

Three‑level partition: 100 top‑level intervals (each 100 million), each split into 100 sub‑intervals (each 1 million), each stored in a RoaringBitmap (~1.2 MB).

Memory Requirement

100 × 100 × 1.2 MB ≈ 12 GB, feasible with distributed storage.

Java Implementation

public class LayeredBitmap { private final RoaringBitmap[][][] bitmaps; private static final int L1 = 100; private static final int L2 = 100; public LayeredBitmap() { bitmaps = new RoaringBitmap[L1][L2][]; } public void add(long qq) { int l1Index = (int)((qq - 10000) / 100_000_000); long remainder = (qq - 10000) % 100_000_000; int l2Index = (int)(remainder / 1_000_000); int value = (int)(remainder % 1_000_000); if (bitmaps[l1Index][l2Index] == null) { bitmaps[l1Index][l2Index] = new RoaringBitmap(); } bitmaps[l1Index][l2Index].add(value); } public boolean contains(long qq) { int l1Index = (int)((qq - 10000) / 100_000_000); long remainder = (qq - 10000) % 100_000_000; int l2Index = (int)(remainder / 1_000_000); int value = (int)(remainder % 1_000_000); RoaringBitmap bitmap = bitmaps[l1Index][l2Index]; return bitmap != null && bitmap.contains(value); } }

Comparison and Recommendation

• Single‑machine bitmap: O(1) time, 100 % accuracy, ~1.16 GB memory. • Bloom filter: O(1) time, <0.1 % false‑positive, ~171 MB memory. • External sort: O(n log n) time, 100 % accuracy, requires disk space. • Spark distributed: O(n) time, 100 % accuracy, uses cluster storage. • Redis real‑time: O(1) time, 100 % accuracy, in‑memory. • Hierarchical bitmap: O(1) time, 100 % accuracy, compressed ~12 GB, suitable for ultra‑large scale.

Practical Experience and Pitfalls

Data Skew Solution

int getShardId(long qq, int shardCount) { String str = String.valueOf(qq); int suffix = Integer.parseInt(str.substring(Math.max(0, str.length() - 6))); return suffix % shardCount; }

Deduplication Accuracy Guarantee

Cost‑Optimization Tips

Hot‑cold separation: keep hot data in memory bitmap, cold data on disk.

Compressed storage: replace plain bitmap with RoaringBitmap.

Tiered storage: recent three months in memory, older data in HBase with compression.

Conclusion

Deduplicating 1 billion QQ numbers boils down to breaking the problem into manageable units, using space‑for‑time trade‑offs, probabilistic structures, hierarchical design, and separating batch from real‑time processing to achieve high efficiency and accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsBitmapdeduplicationbloom-filter
Java Tech Enthusiast
Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.