How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory
This article explores multiple techniques—including bitmap indexing, Bloom filters, external sorting, Spark, and layered bitmap structures—to efficiently remove duplicate QQ numbers from a dataset of up to one billion entries while keeping memory usage below a gigabyte and maintaining high accuracy.
Introduction
The problem of deduplicating 1 billion QQ numbers is introduced, highlighting the data size (10⁹ × 8 bytes ≈ 8 GB) and the need for memory‑efficient solutions.
Technical Challenges
Data Scale Analysis
Raw data : 10⁹ × 8 bytes = 8 GB
HashSet deduplication : requires at least 16 GB due to Java object overhead
Ideal solution : less than 1 GB memory
Core Challenges
1️⃣ Single‑Machine Bitmap Method
Algorithm Principle
Use a bit array to represent the existence of each number.
public class BitMap {
private final byte[] bits;
public BitMap(int maxNum) {
this.bits = new byte[(maxNum >> 3) + 1]; // each byte stores 8 numbers
}
public void add(int num) {
int arrayIndex = num >> 3; // num / 8
int position = num & 0x07; // num % 8
bits[arrayIndex] |= 1 << position;
}
public boolean contains(int num) {
int arrayIndex = num >> 3;
int position = num & 0x07;
return (bits[arrayIndex] & (1 << position)) != 0;
}
}QQ Number Range Optimization
QQ numbers range from 10 000 (5 digits) to 9 999 999 999 (10 digits). The bitmap memory requirement is: (10^10 - 10^4) / 8 / 1024 / 1024 ≈ 1.16 GB Optimized offset storage (qq - 10 000) reduces the bitmap size.
public void add(long qq) {
long num = qq - 10000L;
int arrayIndex = (int)(num >> 3);
int position = (int)(num & 0x07);
bits[arrayIndex] |= 1 << position;
}2️⃣ Advanced Scheme: Bloom Filter
Handling Memory Limits
Bloom filters provide probabilistic deduplication with controllable false‑positive rates.
public class BloomFilter {
private final BitSet bitset;
private final int size;
private final int[] seeds;
public BloomFilter(int size, int hashCount) {
this.bitset = new BitSet(size);
this.size = size;
this.seeds = new int[hashCount];
for (int i = 0; i < hashCount; i++) {
seeds[i] = i * 31;
}
}
public void add(String qq) {
for (int seed : seeds) {
int hash = hash(qq, seed);
bitset.set(Math.abs(hash % size), true);
}
}
public boolean contains(String qq) {
for (int seed : seeds) {
int hash = hash(qq, seed);
if (!bitset.get(Math.abs(hash % size))) {
return false;
}
}
return true;
}
private int hash(String value, int seed) {
int result = 0;
for (char c : value.toCharArray()) {
result = seed * result + c;
}
return result;
}
}3️⃣ Disk‑Based Solution: External Sort & Multi‑Way Merge
Processing Flow
Key Implementation
public void externalSort(String input, String output) throws IOException {
List<File> chunks = splitAndSort(input, 100_000_000); // each chunk ~10 M lines
mergeFiles(chunks, output);
}
void mergeFiles(List<File> files, String output) {
PriorityQueue<MergeEntry> queue = new PriorityQueue<>();
List<BufferedReader> readers = new ArrayList<>();
for (File file : files) {
BufferedReader reader = new BufferedReader(new FileReader(file));
readers.add(reader);
String line = reader.readLine();
if (line != null) {
queue.add(new MergeEntry(line, reader));
}
}
try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) {
long last = -1;
while (!queue.isEmpty()) {
MergeEntry entry = queue.poll();
long qq = Long.parseLong(entry.value);
if (qq != last) {
writer.write(entry.value);
writer.newLine();
last = qq;
}
String next = entry.reader.readLine();
if (next != null) {
queue.add(new MergeEntry(next, entry.reader));
}
}
} finally {
readers.forEach(r -> { try { r.close(); } catch (IOException e) {} });
}
}
class MergeEntry implements Comparable<MergeEntry> {
String value;
BufferedReader reader;
MergeEntry(String value, BufferedReader reader) { this.value = value; this.reader = reader; }
@Override public int compareTo(MergeEntry o) { return this.value.compareTo(o.value); }
}4️⃣ Distributed Solution with Spark
val qqRDD = spark.read.textFile("hdfs://qq_data/*.txt")
.map(_.toLong)
.repartition(1000) // 1000 partitions
val distinctRDD = qqRDD.mapPartitions { iter =>
val bitmap = new RoaringBitmap()
iter.foreach(qq => bitmap.add(qq.toInt))
bitmap.iterator.asScala.map(_.toLong)
}
val globalDistinct = distinctRDD.distinct()
globalDistinct.saveAsTextFile("hdfs://result/")5️⃣ Real‑Time Deduplication with Redis
public class QQDeduplication {
private static final String REDIS_KEY = "qq_set";
public boolean isDuplicate(String qq) {
try (Jedis jedis = jedisPool.getResource()) {
// If cardinality exceeds 1 billion, treat as duplicate directly
if (jedis.pfcount(REDIS_KEY) > 1_000_000_000L) {
return true;
}
return jedis.sadd(REDIS_KEY, qq) == 0; // 0 means already existed
}
}
}6️⃣ Layered Bitmap Index for Ultra‑Large Scale
public class LayeredBitmap {
private final RoaringBitmap[][][] bitmaps;
private static final int L1 = 100; // first‑level shards
private static final int L2 = 100; // second‑level shards
public LayeredBitmap() {
bitmaps = new RoaringBitmap[L1][L2][];
}
public void add(long qq) {
int l1Index = (int)((qq - 10000L) / 100_000_000);
long remainder = (qq - 10000L) % 100_000_000;
int l2Index = (int)(remainder / 1_000_000);
int value = (int)(remainder % 1_000_000);
if (bitmaps[l1Index][l2Index] == null) {
bitmaps[l1Index][l2Index] = new RoaringBitmap();
}
bitmaps[l1Index][l2Index].add(value);
}
public boolean contains(long qq) {
int l1Index = (int)((qq - 10000L) / 100_000_000);
long remainder = (qq - 10000L) % 100_000_000;
int l2Index = (int)(remainder / 1_000_000);
int value = (int)(remainder % 1_000_000);
RoaringBitmap bitmap = bitmaps[l1Index][l2Index];
return bitmap != null && bitmap.contains(value);
}
}7️⃣ Comparison of Schemes
Single‑machine bitmap: Memory ≈ 1.16 GB, Accuracy 100 %, O(1) operations.
Bloom filter (0.1 % false‑positive): Memory ≈ 171 MB, O(k) operations, Accuracy ≈ 99.9 %.
External sort: disk‑based, O(n log n) time, 100 % accuracy.
Spark distributed: cluster storage, O(n) time, 100 % accuracy.
Redis real‑time: O(1) time, 100 % accuracy for incremental data.
Layered bitmap index: O(n) compressed storage, O(1) time, 100 % accuracy for ultra‑large scale.
8️⃣ Practical Tips & Pitfalls
Data Skew Mitigation
int getShardId(long qq, int shardCount) {
String str = String.valueOf(qq);
int suffix = Integer.parseInt(str.substring(Math.max(0, str.length() - 6)));
return suffix % shardCount;
}Accuracy Assurance
Cost Optimization
Cold‑hot separation: hot data in memory bitmap, cold data on disk.
Compressed storage: replace plain bitmap with RoaringBitmap.
Tiered storage: recent three months in memory, older data in HBase with compression.
Conclusion
The deduplication of 1 billion QQ numbers can be broken down into manageable sub‑problems using divide‑and‑conquer, bitmap indexing for O(1) lookups, Bloom filters for space‑efficient probabilistic checks, external sorting for disk‑based processing, and layered designs for ultra‑large scale precision.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
