Big Data 5 min read

Using Bitmap and Bloom Filter for Large‑Scale Data Deduplication in Java

The article explains how to store and deduplicate billions of identifiers by using a bitmap to represent presence with a single bit per value, calculates memory requirements, shows Redis bitmap commands, and introduces Bloom filters as an extension with multiple hash functions for efficient large‑scale data handling.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Using Bitmap and Bloom Filter for Large‑Scale Data Deduplication in Java

First, the article estimates the memory needed to store 4 billion QQ numbers using an unsigned integer (4 bytes each), which requires about 15 GB, noting that Java does not have an unsigned integer type.

Because such raw storage is inefficient, the article proposes using a bitmap (bit array) where each bit represents the existence of a specific identifier, enabling compact deduplication.

A bitmap works like a map that stores only 0 or 1; when the bit at a given index is 1, the corresponding QQ number exists.

To check whether a specific QQ number, e.g., 2924357571 , exists, one simply reads the bit at that index; a value of 1 indicates presence.

Since a bitmap uses one bit per possible value, the full 32‑bit range (2^32‑1) requires 2^32 bits ≈ 512 MB of memory, far less than the 15 GB required by plain integers.

In Java, bitmap operations are often performed via Redis, which provides three common bitmap commands (shown in the accompanying image).

The article also lists other practical uses of bitmaps: large‑scale deduplication (e.g., order numbers, URL crawling), online user counting, video view statistics, and as the foundation of Bloom filters.

A Bloom filter extends the bitmap concept by applying multiple hash functions; only when all corresponding bits are 1 does the filter claim the element exists, offering probabilistic membership testing.

The Hutool library implements a Bloom filter with a BitMapBloomFilter class that supplies five default hash functions. The following code shows its constructor:

public BitMapBloomFilter(int m) {
    int mNum = NumberUtil.div(String.valueOf(m), String.valueOf(5)).intValue();
    long size = mNum * 1024 * 1024 * 8;

    filters = new BloomFilter[]{
        new DefaultFilter(size),
        new ELFFilter(size),
        new JSFilter(size),
        new PJWFilter(size),
        new SDBMFilter(size)
    };
}

Advantages: unlike a plain bitmap, a Bloom filter can handle arbitrary strings or objects by hashing them into multiple bit positions.

Disadvantages: hash collisions can cause false positives, so the filter has an inherent error rate.

JavaBig DataRedisBitMapdeduplicationData StructuresBloom Filter
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.