Fundamentals 12 min read

Understanding Bloom Filter: Concept, Principles, Implementation, and Applications

This article explains the concept, principles, implementation details, and practical applications of Bloom Filters, including formulas for optimal bit array size and hash count, Java code examples using Guava, and common use cases such as deduplication, web crawling, and spam filtering.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Bloom Filter: Concept, Principles, Implementation, and Applications

Algorithm Background

If you need to determine whether an element belongs to a set, a common approach is to store all elements and compare. Data structures like linked lists, trees, and hash tables follow this idea, trading time for space or vice versa. When the dataset grows, storage and lookup become costly.

In scenarios with strict response time requirements, increasing elements leads to larger memory consumption and slower retrieval. The challenge is to find a data structure that offers both low time and space overhead. Bloom Filter provides such a solution.

Bloom Filter Concept

A Bloom Filter, proposed by Bloom in 1970, is essentially a long binary vector combined with multiple random hash functions. It can test whether an element is in a set with high space efficiency and fast query time, at the cost of a certain false‑positive rate and difficulty in deletion.

Bloom Filter Principle

When an element is added, K hash functions map it to K positions in a bit array and set those bits to 1. To query, the same K positions are checked: if any bit is 0 the element is definitely absent; if all are 1 the element is probably present.

Unlike a single‑hash bitmap, a Bloom Filter uses K independent hash functions, reducing collision probability.

Bloom Filter Drawbacks

Bloom Filters sacrifice accuracy and deletion convenience for efficiency.

False positives may occur; a queried element might not be in the set even though all K bits are 1. A whitelist can mitigate this.

Deletion is hard because clearing bits may affect other elements. Counting Bloom Filters address this issue.

Bloom Filter Implementation

Guava provides a ready‑made Bloom Filter implementation. Two essential parameters are the expected number of items n and the desired false‑positive probability fpp. From these, the bit array size m and the number of hash functions k are derived.

(1) Bit Array Size Selection

The optimal size m is calculated as:

(2) Hash Function Selection

The optimal number of hash functions k is:

Choosing good hash functions greatly impacts performance. A common technique is to use a single high‑quality hash (e.g., MurmurHash3) with different seeds to simulate multiple functions.

Code Example: Optimal Number of Bits

/**
 * 计算 Bloom Filter的bit位数m
 *
 * @param n 预期数据量
 * @param p 误判率 (must be 0 < p < 1)
 */
@VisibleForTesting
static long optimalNumOfBits(long n, double p) {
  if (p == 0) {
    p = Double.MIN_VALUE;
  }
  return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}

Code Example: Optimal Number of Hash Functions

/**
 * 计算最佳k值,即在Bloom过滤器中插入的每个元素的哈希数
 *
 * @param n 预期数据量
 * @param m bloom filter中总的bit位数 (must be positive)
 */
@VisibleForTesting
static int optimalNumOfHashFunctions(long n, long m) {
  // (m / n) * log(2), but avoid truncation due to division!
  return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}

Putting data into a Bloom Filter uses MurmurHash3 to obtain a 128‑bit hash, splits it into two 64‑bit values, and combines them to set the appropriate bits.

Code Example: Put Method

public <T> boolean put(T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
  long bitSize = bits.bitSize();
  // 利用MurmurHash3得到数据的hash值对应的字节数组
  byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
  // 取低8个字节、高8个字节,转成long类型
  long hash1 = lowerEight(bytes);
  long hash2 = upperEight(bytes);
  boolean bitsChanged = false;
  long combinedHash = hash1;
  for (int i = 0; i < numHashFunctions; i++) {
    bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
    combinedHash += hash2;
  }
  return bitsChanged;
}

The corresponding mightContain method follows the same hashing logic for membership queries.

Bloom Filter Usage

A simple demo shows how to create and query a Bloom Filter using Guava, similar to using a HashMap.

package com.qunar.sage.wang.common.bloom.filter;

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.Funnels;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.ToString;

public class BloomFilterTest {
    public static void main(String[] args) {
        long expectedInsertions = 10000000;
        double fpp = 0.00001;
        BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), expectedInsertions, fpp);
        bloomFilter.put("aaa");
        bloomFilter.put("bbb");
        boolean containsString = bloomFilter.mightContain("aaa");
        System.out.println(containsString);

        BloomFilter<Email> emailBloomFilter = BloomFilter.create((Funnel<Email>) (from, into) -> into.putString(from.getDomain(), Charsets.UTF_8), expectedInsertions, fpp);
        emailBloomFilter.put(new Email("sage.wang", "quanr.com"));
        boolean containsEmail = emailBloomFilter.mightContain(new Email("sage.wangaaa", "quanr.com"));
        System.out.println(containsEmail);
    }

    @Data
    @Builder
    @ToString
    @AllArgsConstructor
    public static class Email {
        private String userName;
        private String domain;
    }
}

Bloom Filter Applications

Monitoring systems can use Bloom Filters to deduplicate metric names before storing them.

Crawlers filter already‑visited URLs to avoid re‑crawling.

Spam email filtering: a Bloom Filter can represent billions of addresses using a fraction of the memory required by a traditional hash table.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaBig DataGuavabloom-filterhash functionsprobabilistic data structure
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.