Fundamentals 13 min read

Understanding HyperLogLog: Algorithm Principles, Redis Implementation, and Experimental Analysis

This article explores the HyperLogLog algorithm for cardinality estimation, tracing its development from Linear and LogLog counting, detailing its Redis implementation with sparse and dense encodings and command workflows, and presenting experiments that demonstrate its memory efficiency and analyze observed error rates versus the theoretical 0.81% standard deviation.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
Understanding HyperLogLog: Algorithm Principles, Redis Implementation, and Experimental Analysis

The article begins with a real‑world scenario of deduplicating tens of millions of user IDs, noting that conventional structures like Set and HashMap consume growing memory, which motivated the choice of HyperLogLog for its constant‑size footprint.

It then defines HyperLogLog as a probabilistic algorithm for estimating the cardinality of a large set with a standard error of about 0.81%, emphasizing that it requires only ~12 KB of memory regardless of data volume.

The discussion proceeds through the historical evolution of cardinality estimation: first the Linear Counting algorithm proposed by Whang et al. in 1990, then the LogLog Counting algorithm that improves space usage by recording the position of the first 1 in hashed bit strings, and finally the HyperLogLog algorithm introduced by Flajolet et al., which reduces variance by using harmonic mean of bucket estimates and falls back to Linear Counting for small cardinalities.

Next, the article details Redis’s HyperLogLog implementation. It describes the hllhdr structure (magic, encoding, notused, card, registers) and explains the two encoding modes: sparse encoding that uses ZERO, XZERO, and VAL opcodes to compress empty buckets, and dense encoding that stores 6‑bit counters in a contiguous 12 KB array, with bit‑level get/set helpers to handle cross‑byte access.

The command workflows for PFADD, PFCOUNT, and PFMERGE are outlined: PFADD hashes the element with MurmurHash64A, locates the bucket, updates the register if the new count is larger, and invalidates the cached cardinality; PFCOUNT computes the estimate from the registers, applying the bias correction and, for small ranges, the linear‑counting correction.

Two sets of experiments are presented. The first compares memory consumption of Bitmap, HashMap, and HyperLogLog for 10, 10 000, and 10 000 000 distinct IDs, showing that HyperLogLog stays at ~12 KB while the others grow linearly. The second evaluates the actual error of HyperLogLog when used for deduplication via PFADD/PFCOUNT, revealing that the observed false‑positive rate is high when the algorithm is misused, but the true cardinality estimate remains within the expected 0.53 % error for large datasets.

The article concludes that HyperLogLog is a valuable tool for approximate cardinality estimation when memory is constrained, provided it is used correctly—i.e., relying on PFCOUNT for estimates rather than treating PFADD return values as duplicate flags. It also lists references covering the original papers, Redis source‑code analyses, and related blog posts.

AlgorithmHyperLogLogRedisData Structurescardinality estimation
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.