Big Data 12 min read

How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory

This article explores multiple techniques—including bitmap indexing, Bloom filters, external sorting, Spark, and layered bitmap structures—to efficiently remove duplicate QQ numbers from a dataset of up to one billion entries while keeping memory usage below a gigabyte and maintaining high accuracy.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to De‑Duplicate 1 Billion QQ Numbers Using Under 1 GB of Memory

Introduction

The problem of deduplicating 1 billion QQ numbers is introduced, highlighting the data size (10⁹ × 8 bytes ≈ 8 GB) and the need for memory‑efficient solutions.

Technical Challenges

Data Scale Analysis

Raw data : 10⁹ × 8 bytes = 8 GB

HashSet deduplication : requires at least 16 GB due to Java object overhead

Ideal solution : less than 1 GB memory

Core Challenges

1️⃣ Single‑Machine Bitmap Method

Algorithm Principle

Use a bit array to represent the existence of each number.

public class BitMap {
    private final byte[] bits;
    public BitMap(int maxNum) {
        this.bits = new byte[(maxNum >> 3) + 1]; // each byte stores 8 numbers
    }
    public void add(int num) {
        int arrayIndex = num >> 3; // num / 8
        int position = num & 0x07; // num % 8
        bits[arrayIndex] |= 1 << position;
    }
    public boolean contains(int num) {
        int arrayIndex = num >> 3;
        int position = num & 0x07;
        return (bits[arrayIndex] & (1 << position)) != 0;
    }
}

QQ Number Range Optimization

QQ numbers range from 10 000 (5 digits) to 9 999 999 999 (10 digits). The bitmap memory requirement is: (10^10 - 10^4) / 8 / 1024 / 1024 ≈ 1.16 GB Optimized offset storage (qq - 10 000) reduces the bitmap size.

public void add(long qq) {
    long num = qq - 10000L;
    int arrayIndex = (int)(num >> 3);
    int position = (int)(num & 0x07);
    bits[arrayIndex] |= 1 << position;
}

2️⃣ Advanced Scheme: Bloom Filter

Handling Memory Limits

Bloom filters provide probabilistic deduplication with controllable false‑positive rates.

public class BloomFilter {
    private final BitSet bitset;
    private final int size;
    private final int[] seeds;
    public BloomFilter(int size, int hashCount) {
        this.bitset = new BitSet(size);
        this.size = size;
        this.seeds = new int[hashCount];
        for (int i = 0; i < hashCount; i++) {
            seeds[i] = i * 31;
        }
    }
    public void add(String qq) {
        for (int seed : seeds) {
            int hash = hash(qq, seed);
            bitset.set(Math.abs(hash % size), true);
        }
    }
    public boolean contains(String qq) {
        for (int seed : seeds) {
            int hash = hash(qq, seed);
            if (!bitset.get(Math.abs(hash % size))) {
                return false;
            }
        }
        return true;
    }
    private int hash(String value, int seed) {
        int result = 0;
        for (char c : value.toCharArray()) {
            result = seed * result + c;
        }
        return result;
    }
}

3️⃣ Disk‑Based Solution: External Sort & Multi‑Way Merge

Processing Flow

Key Implementation

public void externalSort(String input, String output) throws IOException {
    List<File> chunks = splitAndSort(input, 100_000_000); // each chunk ~10 M lines
    mergeFiles(chunks, output);
}

void mergeFiles(List<File> files, String output) {
    PriorityQueue<MergeEntry> queue = new PriorityQueue<>();
    List<BufferedReader> readers = new ArrayList<>();
    for (File file : files) {
        BufferedReader reader = new BufferedReader(new FileReader(file));
        readers.add(reader);
        String line = reader.readLine();
        if (line != null) {
            queue.add(new MergeEntry(line, reader));
        }
    }
    try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) {
        long last = -1;
        while (!queue.isEmpty()) {
            MergeEntry entry = queue.poll();
            long qq = Long.parseLong(entry.value);
            if (qq != last) {
                writer.write(entry.value);
                writer.newLine();
                last = qq;
            }
            String next = entry.reader.readLine();
            if (next != null) {
                queue.add(new MergeEntry(next, entry.reader));
            }
        }
    } finally {
        readers.forEach(r -> { try { r.close(); } catch (IOException e) {} });
    }
}

class MergeEntry implements Comparable<MergeEntry> {
    String value;
    BufferedReader reader;
    MergeEntry(String value, BufferedReader reader) { this.value = value; this.reader = reader; }
    @Override public int compareTo(MergeEntry o) { return this.value.compareTo(o.value); }
}

4️⃣ Distributed Solution with Spark

val qqRDD = spark.read.textFile("hdfs://qq_data/*.txt")
    .map(_.toLong)
    .repartition(1000) // 1000 partitions

val distinctRDD = qqRDD.mapPartitions { iter =>
    val bitmap = new RoaringBitmap()
    iter.foreach(qq => bitmap.add(qq.toInt))
    bitmap.iterator.asScala.map(_.toLong)
}

val globalDistinct = distinctRDD.distinct()
globalDistinct.saveAsTextFile("hdfs://result/")

5️⃣ Real‑Time Deduplication with Redis

public class QQDeduplication {
    private static final String REDIS_KEY = "qq_set";
    public boolean isDuplicate(String qq) {
        try (Jedis jedis = jedisPool.getResource()) {
            // If cardinality exceeds 1 billion, treat as duplicate directly
            if (jedis.pfcount(REDIS_KEY) > 1_000_000_000L) {
                return true;
            }
            return jedis.sadd(REDIS_KEY, qq) == 0; // 0 means already existed
        }
    }
}

6️⃣ Layered Bitmap Index for Ultra‑Large Scale

public class LayeredBitmap {
    private final RoaringBitmap[][][] bitmaps;
    private static final int L1 = 100; // first‑level shards
    private static final int L2 = 100; // second‑level shards
    public LayeredBitmap() {
        bitmaps = new RoaringBitmap[L1][L2][];
    }
    public void add(long qq) {
        int l1Index = (int)((qq - 10000L) / 100_000_000);
        long remainder = (qq - 10000L) % 100_000_000;
        int l2Index = (int)(remainder / 1_000_000);
        int value = (int)(remainder % 1_000_000);
        if (bitmaps[l1Index][l2Index] == null) {
            bitmaps[l1Index][l2Index] = new RoaringBitmap();
        }
        bitmaps[l1Index][l2Index].add(value);
    }
    public boolean contains(long qq) {
        int l1Index = (int)((qq - 10000L) / 100_000_000);
        long remainder = (qq - 10000L) % 100_000_000;
        int l2Index = (int)(remainder / 1_000_000);
        int value = (int)(remainder % 1_000_000);
        RoaringBitmap bitmap = bitmaps[l1Index][l2Index];
        return bitmap != null && bitmap.contains(value);
    }
}

7️⃣ Comparison of Schemes

Single‑machine bitmap: Memory ≈ 1.16 GB, Accuracy 100 %, O(1) operations.

Bloom filter (0.1 % false‑positive): Memory ≈ 171 MB, O(k) operations, Accuracy ≈ 99.9 %.

External sort: disk‑based, O(n log n) time, 100 % accuracy.

Spark distributed: cluster storage, O(n) time, 100 % accuracy.

Redis real‑time: O(1) time, 100 % accuracy for incremental data.

Layered bitmap index: O(n) compressed storage, O(1) time, 100 % accuracy for ultra‑large scale.

8️⃣ Practical Tips & Pitfalls

Data Skew Mitigation

int getShardId(long qq, int shardCount) {
    String str = String.valueOf(qq);
    int suffix = Integer.parseInt(str.substring(Math.max(0, str.length() - 6)));
    return suffix % shardCount;
}

Accuracy Assurance

Cost Optimization

Cold‑hot separation: hot data in memory bitmap, cold data on disk.

Compressed storage: replace plain bitmap with RoaringBitmap.

Tiered storage: recent three months in memory, older data in HBase with compression.

Conclusion

The deduplication of 1 billion QQ numbers can be broken down into manageable sub‑problems using divide‑and‑conquer, bitmap indexing for O(1) lookups, Bloom filters for space‑efficient probabilistic checks, external sorting for disk‑based processing, and layered designs for ultra‑large scale precision.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsBitmapdeduplicationbloom-filterSpark
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.