Big Data 17 min read

How HyperLogLog Estimates Cardinality in Massive Data Sets

This article explains the cardinality‑counting problem behind DAU/MAU and unique visitor metrics, compares naïve solutions like Set, Bitmap and Bloom filter, introduces big‑data algorithms such as Linear Counting, LogLog and HyperLogLog, and shows how Redis implements HyperLogLog with dense and sparse storage optimizations.

vivo Internet Technology

Jan 6, 2021

How HyperLogLog Estimates Cardinality in Massive Data Sets

Problem Introduction

Counting daily active users (DAU), monthly active users (MAU) or unique visitors is a cardinality‑counting problem: estimating the number of distinct elements in a multiset.

Basic Exact Methods

Set : Store every element in a hash‑ or tree‑based set and rely on de‑duplication. Accurate but memory‑intensive (e.g., 100 M 4‑byte keys need ~400 MiB).

Bitmap : Allocate one bit per possible value. Memory depends on the value range; a bitmap for the range 1‒100 M requires ~12 MiB.

Bloom filter : Multiple hash functions map an element to several bits, introducing a controllable false‑positive rate.

These exact structures become impractical when the number of distinct elements reaches hundreds of millions or billions.

Approximate Cardinality Algorithms for Big Data

Probabilistic algorithms trade a small relative error for orders‑of‑magnitude lower memory consumption. The most widely used are:

Linear Counting (LC)

LogLog Counting (LLC)

HyperLogLog Counting (HLLC)

All produce an estimate rather than an exact count.

Bernoulli‑Trial Foundation

The core of LogLog and HyperLogLog is the analysis of a Bernoulli trial. For a fair coin (p = ½), let k be the number of flips needed to obtain the first head. For N independent trials, the probability that the maximum k among them does not exceed a value k is

and the probability that at least one trial exceeds k is

These formulas allow us to infer N from the observed maximum k .

LogLog Counting (LLC)

Each element is hashed to a 32‑bit value. The position of the first 1 bit when scanning from the least‑significant side is recorded as k . The hash space is divided into m buckets (e.g., 2¹⁰ = 1024 buckets using the lower 10 bits). For each bucket we keep the largest k observed and compute the arithmetic mean of all bucket maxima:

LLC stores only 5 bits per bucket, so 1024 × 5 bits ≈ 640 B of memory. Its standard error is about 1.3 %.

HyperLogLog Counting (HLLC)

HLLC replaces the arithmetic mean with the harmonic mean, which is less sensitive to outliers. The estimate is

HLLC also uses 5 bits per bucket, so memory consumption is identical to LLC, but the typical relative error drops to ~1 %.

Redis Implementation of HyperLogLog

Redis added native HyperLogLog support in version 2.8.9. The primary commands are: PFADD – add one or more elements to an HLL. PFCOUNT – return the estimated cardinality of an HLL. PFMERGE – merge multiple HLLs into a single one.

Redis hashes each element to a 64‑bit value. The lower 14 bits select one of 16 384 buckets; the upper 50 bits are used for the Bernoulli trial, requiring only 6 bits per bucket. Consequently the whole structure occupies a fixed 12 KB regardless of cardinality, with a standard error of ≈0.81 %.

Dense vs. Sparse Storage

Dense storage is a fixed 12 KB array where each bucket occupies 6 bits. Reads and writes are performed with bit‑wise operations to locate the bucket.

Sparse storage encodes the 16 384 buckets with a compact byte‑wise representation:

ZERO (1 byte): run of up to 64 consecutive zero‑valued buckets.

XZERO (2 bytes): run of up to 16 384 zero‑valued buckets.

VAL (1 byte): run of up to 32 identical non‑zero values together with the run length.

The structure starts as a single XZERO. As elements are added, a few ZERO, XZERO and VAL entries replace the initial XZERO. Redis switches from sparse to dense when a bucket’s value exceeds 32 or when the sparse representation grows beyond roughly 3 KB.

Key Takeaways

HyperLogLog provides a memory‑efficient way to estimate set cardinality in massive data streams, making it suitable for DAU/MAU, unique‑visitor counting, and similar metrics. Redis’s built‑in HLL implementation demonstrates a practical deployment with fixed‑size storage and automatic dense/sparse switching.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

algorithm Big Data HyperLogLog Redis probability Cardinality

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.