Cardinality Counting in Presto: Algorithms, Implementation, and Best Practices
The article explains cardinality counting in Presto, comparing exact set‑based methods with memory‑efficient bitmap, Linear Count, and HyperLogLog approximations, detailing their algorithms, implementation in Presto’s query engine, and offering best‑practice recommendations for choosing the appropriate technique in business workloads.
This article, authored by the Vivo Internet User Operations Development Team, introduces the basic principles of statistical counting and the implementation ideas in Presto. It covers exact and approximate counting, their advantages and disadvantages, and provides practical recommendations for using statistical counting in business scenarios.
Background
Counting is a fundamental operation in Hadoop (e.g., the classic word‑count example) and is essential for BI, marketing analytics, and pagination modules that need total record counts. While simple counting (e.g., wc on Linux) is trivial, counting distinct elements (cardinality counting) such as PV/UV is more challenging, especially in distributed SQL engines like Presto.
2. Cardinality Counting Algorithms
In SQL, cardinality counting corresponds to COUNT(DISTINCT field) or approx_distinct() . Exact counting typically uses a Set data structure, which can be implemented as a hash‑based or tree‑based set. Both suffer from linear memory growth with data size.
Two main strategies address the memory issue:
Replace the set with a more compact structure, such as a bitmap.
Accept approximation and use probabilistic algorithms like Linear Count or HyperLogLog.
2.1 Bitmap
Bitmap indexes use a bit vector to record the presence of a value in each row. They require a permanent, sequential identifier (e.g., an auto‑increment primary key). The article illustrates bitmap construction for a sample user table, showing how to build bit vectors for the age field (values 30, 40, 50) and the name field.
Example of bitmap construction steps (S1‑S3) is shown with accompanying images (omitted here for brevity).
Using the bitmap, a query such as select count(1) from user where age=40; can be answered by counting the number of 1s in the corresponding bit vector. For multi‑condition queries, intersect the relevant bit vectors (e.g., age=40 and name='baz') and count the resulting ones.
The advantages of bitmap include low memory consumption, fast bit‑wise operations, and the possibility of compression. However, non‑numeric columns need additional transformation.
2.2 Linear Count Algorithm
Linear Count (LC) converts the cardinality problem into a mathematical estimation:
Initialize m empty “rooms”.
Hash each element into one of the rooms.
Count the number of empty rooms U.
Estimate n = -m * log(U/m).The accuracy depends on the choice of m and the variance of the estimator.
2.3 HyperLogLog Algorithm
HyperLogLog (HLL) provides near‑optimal cardinality estimation with very small memory (e.g., 12 KB in Redis can estimate up to 2^64 distinct items). It relies on Bernoulli trials and harmonic averaging. The article presents a Java implementation of a coin‑toss simulation to illustrate the underlying probabilistic experiment:
private Random random = new Random();
/**
* 0 represents heads, 1 represents tails
*/
public int tossCoin() {
int r, cnt = 0;
do {
r = random.nextInt(2);
cnt++;
} while (r < 1);
return cnt;
}Experimental results for 10, 100, and 1000 toss rounds are shown, demonstrating the distribution of trial lengths.
The algorithm maps each element to a group via a hash function (e.g., MurmurHash3). The leading bits determine the group, while the remaining bits are used to compute the number of leading zeros (the “p” value). The maximum p per group is recorded, and the final estimate is derived from a standard HLL formula.
3. Distributed Counting Core Process
The article outlines the typical MapReduce flow for counting and then shows Presto’s execution pipeline, which also follows a “group‑aggregate‑merge” pattern but with distributed hash‑based shuffling.
4. Cardinality Counting in Presto
Presto supports two implementations:
COUNT(DISTINCT …) – exact counting using hash distribution and partial aggregation.
approx_distinct(…) – approximate counting based on HyperLogLog (with an internal fallback to Linear Count for very high cardinalities). The underlying library (airlift) uses MurmurHash3 to generate 64‑bit hashes, the first 6 bits for group selection, and the rest for HLL computation.
5. Business Recommendations
High‑cardinality counting is memory‑intensive and generates significant network traffic in distributed systems. When exact counts are not required, using approximate methods (e.g., HLL) can dramatically reduce resource consumption and improve latency. For fields with moderate cardinality, bitmap indexes are a viable alternative.
In summary, understanding the trade‑offs between exact and approximate counting enables more efficient system design, especially when choosing between bitmap indexes, Linear Count, and HyperLogLog for different data distributions.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.