Big Data 12 min read

Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications

The article explains how space‑efficient probabilistic structures such as BloomFilter and Count‑min Sketch enable large‑scale data deduplication, join pruning, real‑time idempotent filtering, and approximate top‑K analytics by trading modest accuracy loss for dramatically reduced storage and faster computation.

NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications

As data volumes grow to billions or trillions of records, efficiently obtaining statistical results and deduplicated aggregate metrics with fewer resources and less time has become a frequent business requirement. Often, obtaining exact numbers is unnecessary; instead, understanding overall data trends or macro descriptive statistical features across dimensions is sufficient for decision-making and business effect evaluation. This has made research into probability algorithms a热门研究方向.

This article introduces two effective algorithms that evolved from HASH algorithms, discussing their applications in big data development:

1. BloomFilter

BloomFilter is a space-efficient probabilistic data structure based on HASH algorithms. It can represent large datasets using minimal storage by compressing data through multiple HASH functions.

Key characteristics:

If an element is not in the set, it definitely is not in the set (no false negatives)

If an element is in the set, it may or may not be in the set (possible false positives)

Compressed data is difficult to restore to original form

Strong compression ratio - millions of records can be represented in just KB or MB

Supports aggregation operations across different dimensions

Applications:

Deduplication metrics calculation: In offline scenarios, first map target objects to BF structure, then use BF characteristics to calculate deduplication across different dimensions. In Hive 2.7+, both standard BloomFilter and BloomKFilter implementations are available.

JOIN optimization: For scenarios like A LEFT JOIN B where both tables are large, use BF to filter A's JOIN keys that exist in B, reducing unnecessary network shuffle operations.

Real-time scenarios: Combined with distributed caching systems (or HBase) for metric calculation and message idempotent filtering.

2. Count-min Sketch

Count-min Sketch is a frequency estimation algorithm derived from BloomFilter. It uses multiple HASH functions to map input data to an array, recording the count of each hash collision. The algorithm returns the minimum value among all HASH results to reduce overestimation errors.

Key characteristics:

Minimal storage independent of data volume - tens of thousands of records require only tens of KB

Non-exact probabilistic calculation

Frequency results tend to be overestimated

Fast computation speed

Applications:

TOP-K ranking with acceptable error margins in both offline and real-time scenarios

Leaderboards, topic rankings, and complaint statistics

These algorithms trade some accuracy for significant improvements in storage efficiency and computational speed, making them valuable for large-scale data processing scenarios where approximate results are acceptable.

Big DatahashJoin OptimizationBloomFilterCount-Min Sketchdata deduplicationprobability algorithmSketch Algorithm
NetEase Yanxuan Technology Product Team
Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.