Big Data 10 min read

Using Bitmap and Bloom Filter to De‑duplicate 4 Billion IDs Within 1 GB Memory

The article explains how to store and de‑duplicate 4 billion unsigned integers using a bitmap to reduce memory from 14.9 GB to under 500 MB, introduces the concept and benefits of bitmaps, describes Bloom filters, their principles, advantages, limitations, typical use cases, and provides Java and Redis implementation examples.

Architect's Guide

Jul 24, 2023

Using Bitmap and Bloom Filter to De‑duplicate 4 Billion IDs Within 1 GB Memory

Storing 4 billion unsigned integers directly would require about 4*4000000000 /1024/1024/1024 = 14.9G of memory, which far exceeds a 1 GB limit.

By using a bitmap, each number occupies only one bit, reducing the required space to 4000000000 * 1 /8 /1024/1024 = 476M, a dramatic saving.

To insert a QQ number such as "907607222" into the bitmap, locate the bit at that index and set it to 1; duplicate numbers will simply set the same bit again.

After all numbers are stored, any bit set to 1 indicates the presence of that QQ number, and scanning the bitmap yields the unique set.

What Is a Bitmap and Its Uses?

A bitmap (bit array) uses a single bit to represent the existence of an element, allowing massive space savings compared to storing full integers.

For example, representing the numbers 1, 4, and 6 with a bitmap requires only three bits, whereas using three unsigned ints would need 12 bytes (96 bits).

The primary advantage of a bitmap is its ability to save memory, making it ideal for de‑duplication, sorting, and as the foundation of probabilistic structures like Bloom filters.

What Is a Bloom Filter and How Does It Work?

A Bloom filter is a probabilistic data structure that quickly tests whether an element may exist in a set using a bit array and multiple hash functions.

When adding an element, each hash function maps it to several bit positions, which are set to 1. To query, the same hash functions are applied; if all corresponding bits are 1, the element is possibly present, otherwise it is definitely absent.

Bloom filters guarantee no false negatives but may produce false positives due to hash collisions.

The filter is initialized by specifying the expected number of elements and an acceptable false‑positive rate, after which bits are allocated and hash functions are chosen.

Operations include initialization, adding elements, and querying membership, each involving setting or checking bits derived from hash functions.

Advantages: fast membership checks and low memory usage. Drawbacks: false positives and difficulty deleting elements because bits may be shared among many items.

Typical applications include web crawlers (avoiding re‑crawling), cache systems (preventing cache penetration), distributed systems (reducing network traffic), spam filtering, and blacklist checks.

How to Use Bloom Filters in Java and Redis

Java libraries such as Google Guava and Apache Commons provide Bloom filter implementations.

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterExample {
    public static void main(String[] args) {
        // Create a Bloom filter for 100 expected elements with 1% false‑positive rate
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(), 100, 0.01);
        // Insert elements
        bloomFilter.put("Lynn");
        bloomFilter.put("666");
        bloomFilter.put("八股文");
        // Test membership
        System.out.println(bloomFilter.mightContain("Lynn")); // true
        System.out.println(bloomFilter.mightContain("张三")); // false
    }
}

Apache Commons example:

import org.apache.commons.lang3.StringUtils;
import org.apache.commons.collections4.BloomFilter;
import org.apache.commons.collections4.functors.HashFunctionIdentity;
public class BloomFilterExample {
    public static void main(String[] args) {
        BloomFilter<String> bloomFilter = new BloomFilter<>(HashFunctionIdentity.hashFunction(StringUtils::hashCode), 100, 0.01);
        bloomFilter.put("Lynn");
        bloomFilter.put("666");
        bloomFilter.put("八股文");
        System.out.println(bloomFilter.mightContain("Lynn")); // true
        System.out.println(bloomFilter.mightContain("张三")); // false
    }
}

Redis can use the Bloom module via Redisson:

Config config = new Config();
config.useSingleServer().setAddress("redis://127.0.0.1:6379");
RedissonClient redisson = Redisson.create(config);
RBloomFilter<String> bloomFilter = redisson.getBloomFilter("myfilter");
bloomFilter.tryInit(100, 0.01);
bloomFilter.add("Lynn");
bloomFilter.add("666");
bloomFilter.add("八股文");
System.out.println(bloomFilter.contains("Lynn"));
System.out.println(bloomFilter.contains("张三"));
redisson.shutdown();

Or using Jedis:

Jedis jedis = new Jedis("localhost");
jedis.bfCreate("myfilter", 100, 0.01);
jedis.bfAdd("myfilter", "Lynn");
jedis.bfAdd("myfilter", "666");
jedis.bfAdd("myfilter", "八股文");
System.out.println(jedis.bfExists("myfilter", "Lynn")); // true
System.out.println(jedis.bfExists("myfilter", "张三")); // false
jedis.close();

These examples demonstrate how bitmaps and Bloom filters can efficiently handle massive datasets such as billions of QQ numbers while staying within limited memory constraints.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Bitmap deduplication Data Structures bloom-filter

Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.