Big Data 13 min read

How to De‑duplicate 1 Billion QQ Numbers: Bitmap, Bloom Filter, and Distributed Strategies

This article explores multiple techniques for efficiently deduplicating ten‑hundred‑million QQ identifiers, covering single‑machine bitmap methods, Bloom filter trade‑offs, external sorting, Spark and Redis distributed solutions, and a layered bitmap index that balances memory usage, speed, and accuracy.

IT Services Circle
IT Services Circle
IT Services Circle
How to De‑duplicate 1 Billion QQ Numbers: Bitmap, Bloom Filter, and Distributed Strategies

Introduction

Recently a question appeared online: how to deduplicate 1 billion QQ numbers? This article shares several common solutions and practical tips.

Technical Challenges

1.1 Data Scale Analysis

Original data : 1 billion × 8 bytes = 8 GB

HashSet deduplication : requires at least 16 GB memory (Java object overhead)

Ideal solution : less than 1 GB memory

1.2 Core Challenges

Image
Image

Solution I – Single‑Machine Bitmap Method

2.1 Algorithm Principle

Use a bit array to represent the existence of each number.

public class BitMap {
    private final byte[] bits;
    public BitMap(int maxNum) {
        this.bits = new byte[(maxNum >> 3) + 1];
    }
    public void add(int num) {
        int arrayIndex = num >> 3;
        int position = num & 0x07;
        bits[arrayIndex] |= 1 << position;
    }
    public boolean contains(int num) {
        int arrayIndex = num >> 3;
        int position = num & 0x07;
        return (bits[arrayIndex] & (1 << position)) != 0;
    }
}

2.2 QQ Range Optimization

QQ range: 10000 (5 digits) – 9999999999 (10 digits).

Bitmap memory calculation : (10^10 – 10^4) / 8 / 1024 / 1024 ≈ 1.16 GB

Optimization – store (qq – 10000) to reduce range.

public void add(long qq) {
    long num = qq - 10000;
    int arrayIndex = (int)(num >> 3);
    int position = (int)(num & 0x07);
    bits[arrayIndex] |= 1 << position;
}

Solution II – Bloom Filter

3.1 Memory Constraints

Bloom filter provides constant‑time checks with controllable false‑positive rate.

3.2 Parameter Design and Implementation

public class BloomFilter {
    private final BitSet bitset;
    private final int size;
    private final int[] seeds;
    public BloomFilter(int size, int hashCount) {
        this.bitset = new BitSet(size);
        this.size = size;
        this.seeds = new int[hashCount];
        for (int i = 0; i < hashCount; i++) {
            seeds[i] = i * 31;
        }
    }
    public void add(String qq) {
        for (int seed : seeds) {
            int hash = hash(qq, seed);
            bitset.set(Math.abs(hash % size), true);
        }
    }
    public boolean contains(String qq) {
        for (int seed : seeds) {
            int hash = hash(qq, seed);
            if (!bitset.get(Math.abs(hash % size))) {
                return false;
            }
        }
        return true;
    }
    private int hash(String value, int seed) {
        int result = 0;
        for (char c : value.toCharArray()) {
            result = seed * result + c;
        }
        return result;
    }
}

3.3 Memory‑Optimization Effect

Original storage: 8 GB, Bitmap: 1.16 GB, Bloom filter (0.1 % error): 171 MB.

Solution III – Disk‑Based External Sort and Multi‑Way Merge

4.1 Process Flow

Image
Image

4.2 Key Code

public void externalSort(String input, String output) throws IOException {
    List<File> chunks = splitAndSort(input, 100_000_000);
    mergeFiles(chunks, output);
}
void mergeFiles(List<File> files, String output) throws IOException {
    PriorityQueue<MergeEntry> queue = new PriorityQueue<>();
    List<BufferedReader> readers = new ArrayList<>();
    for (File file : files) {
        BufferedReader reader = new BufferedReader(new FileReader(file));
        readers.add(reader);
        String line = reader.readLine();
        if (line != null) {
            queue.add(new MergeEntry(line, reader));
        }
    }
    try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) {
        long last = -1;
        while (!queue.isEmpty()) {
            MergeEntry entry = queue.poll();
            long qq = Long.parseLong(entry.value);
            if (qq != last) {
                writer.write(entry.value);
                writer.newLine();
                last = qq;
            }
            String next = entry.reader.readLine();
            if (next != null) {
                queue.add(new MergeEntry(next, entry.reader));
            }
        }
    } finally {
        readers.forEach(r -> {
            try { r.close(); } catch (IOException e) {}
        });
    }
}
class MergeEntry implements Comparable<MergeEntry> {
    String value;
    BufferedReader reader;
    MergeEntry(String value, BufferedReader reader) {
        this.value = value;
        this.reader = reader;
    }
    public int compareTo(MergeEntry o) {
        return this.value.compareTo(o.value);
    }
}

Solution IV – Distributed Approaches

5.1 Sharding Strategy

Image
Image

5.2 Spark Implementation

val qqRDD = spark.read.textFile("hdfs://qq_data/*.txt")
    .map(_.toLong)
    .repartition(1000) // 1000 partitions

val distinctRDD = qqRDD.mapPartitions { iter =>
    val bitmap = new RoaringBitmap()
    iter.foreach(qq => bitmap.add(qq.toInt))
    bitmap.iterator.asScala.map(_.toLong)
}

val globalDistinct = distinctRDD.distinct()
globalDistinct.saveAsTextFile("hdfs://result/")

5.3 Memory Optimization – RoaringBitmap

Ordinary bitmap: ~1.16 GB; RoaringBitmap (sparse data): 100‑300 MB.

Solution V – Production‑Level Lambda Architecture

6.1 System Architecture Diagram

Image
Image

6.2 Technology Stack per Layer

Batch layer: Spark + HDFS (full‑batch deduplication) Speed layer: Flink + Redis (real‑time incremental deduplication) Service layer: Spring Boot + HBase (unified query API)

6.3 Real‑Time Deduplication Code

public class QQDeduplication {
    private static final String REDIS_KEY = "qq_set";
    public boolean isDuplicate(String qq) {
        try (Jedis jedis = jedisPool.getResource()) {
            // If estimated cardinality exceeds 1 billion, treat as duplicate
            if (jedis.pfcount(REDIS_KEY) > 1_000_000_000L) {
                return true;
            }
            return jedis.sadd(REDIS_KEY, qq) == 0;
        }
    }
}

Solution VI – Ultimate Layered Bitmap Index

7.1 Architecture Design

Image
Image

7.2 Storage and Computation

First layer: 100 intervals, each 100 million IDs.

Second layer: 100 sub‑intervals per first‑layer interval, each 1 million IDs.

Third layer: RoaringBitmap per sub‑interval (≈1.2 MB).

Total memory ≈ 12 GB (feasible with distributed storage).

7.3 Java Implementation

public class LayeredBitmap {
    private RoaringBitmap[][][] bitmaps;
    private static final int L1 = 100;
    private static final int L2 = 100;

    public LayeredBitmap() {
        bitmaps = new RoaringBitmap[L1][L2][];
    }

    public void add(long qq) {
        int l1Index = (int)((qq - 10000) / 100_000_000);
        long remainder = (qq - 10000) % 100_000_000;
        int l2Index = (int)(remainder / 1_000_000);
        int value = (int)(remainder % 1_000_000);
        if (bitmaps[l1Index][l2Index] == null) {
            bitmaps[l1Index][l2Index] = new RoaringBitmap();
        }
        bitmaps[l1Index][l2Index].add(value);
    }

    public boolean contains(long qq) {
        int l1Index = (int)((qq - 10000) / 100_000_000);
        long remainder = (qq - 10000) % 100_000_000;
        int l2Index = (int)(remainder / 1_000_000);
        int value = (int)(remainder % 1_000_000);
        RoaringBitmap bitmap = bitmaps[l1Index][l2Index];
        return bitmap != null && bitmap.contains(value);
    }
}

Comparison and Recommendation

Key metrics (memory, time complexity, accuracy) for each approach:

Solution            | Scenario                     | Memory/Storage | Time Complexity | Accuracy
-----------------------------------------------------------------------------------------------
Single‑machine bitmap| < 100 M records              | O(n) ~1 GB     | O(1)            | 100%
Bloom filter        | Hundreds of billions (tolerate error) | O(1) | O(k) | <99.9% (configurable)
External sort       | Single‑machine disk processing| Disk space     | O(n log n)      | 100%
Spark distributed   | Massive batch processing     | Cluster storage| O(n)            | 100%
Redis real‑time     | Incremental stream deduplication| O(n)        | O(1)            | 100%
Layered bitmap index| Ultra‑large precise deduplication| O(n) compressed| O(1) | 100%

Practical Experience and Pitfalls

9.1 Data Skew Solution

int getShardId(long qq, int shardCount) {
    String str = String.valueOf(qq);
    int suffix = Integer.parseInt(str.substring(Math.max(0, str.length() - 6)));
    return suffix % shardCount;
}

9.2 Deduplication Accuracy Assurance

Image
Image

9.3 Cost‑Optimization Tips

Hot‑cold separation : keep hot data in memory bitmap, cold data on disk.

Compressed storage : use RoaringBitmap instead of plain bitmap.

Tiered storage : recent three months in memory, older data in HBase with compression.

Conclusion

Divide‑and‑conquer: split 1 billion IDs into manageable chunks.

Space‑for‑time trade‑off: bitmap gives O(1) checks at the cost of memory.

Probabilistic wisdom: Bloom filter reduces memory with acceptable false‑positive rate.

Layered design: billions → millions → thousands for scalable handling.

Batch‑plus‑real‑time separation: Spark for historical data, Redis/Flink for incremental streams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BitmapdeduplicationDistributed
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.