How BitMap Accelerates Active-Day Distribution Calculations in Big Data
BitMap, a space‑saving bit‑array structure, can replace costly I/O‑heavy Spark jobs for computing user active‑day distributions by converting joins and distinct operations into fast bitwise logic, enabling efficient 30‑day rolling metrics with minimal memory and superior performance, as demonstrated by real‑world benchmarks.
1. BitMap Introduction
BitMap stores data using a bit array, saving space and enabling efficient computation, making it suitable for large‑scale data processing where I/O overhead is a bottleneck. By representing data states with bits and applying compression such as RLE, billions of records can be loaded into memory at once, and operations like join and distinct become simple bitwise AND/OR/XOR/NOT, which is especially effective for UV calculations and multidimensional cross‑analysis.
2. Applying BitMap to Active‑Day Distribution
Traditional Spark code loads a month’s activity logs, counts each user’s active days, and aggregates the distribution, which is resource‑intensive. The article presents a BitMap‑based approach that replaces this pipeline with bitwise operations, allowing the same metric to be computed daily without reloading raw logs.
Example Spark Scala code (original implementation):
val activeData = sc.objectFile(inputHdfsPath) // each uid and its active dates
activeData.map{ case (uid, activeDate) => (uid, 1) }
.reduceByKey(_ + _)
.map{ case (uid, activeCount) => (activeCount, 1) }
.reduceByKey(_ + _)
.saveAsObjectFile(outputHdfsPath)The BitMap method encodes each uid to an index; a bit set to 1 indicates presence. Bitwise set operations (AND, OR, XOR, NOT) correspond to set intersections, unions, differences, and complements, enabling fast computation of active‑day distributions.
Definitions of set operations:
M∩N (AND) – users present in both BitMaps.
M∪N (OR) – users present in either BitMap.
M⊕N (XOR) – users present in M but not in N.
^N (NOT) – users not present in N.
For a 30‑day rolling window, BitMaps A_i (users active i days in April) are combined with daily BitMaps C (April 1) and D (May 1) to derive BitMaps S4, S5, S45, and subsequently the target distribution B_i using formulas such as: B1 = (A0 ∩ S5) ∪ (A1 ∩ S45) ∪ (A2 ∩ S4) and similarly for B2 … B30, where A0 = ¬(A1 ∪ A2 … ∪ A30) and A31 = Ø.
The construction of A from daily BitMaps E_i (i = 1…30) proceeds iteratively, updating A0…A_i with bitwise intersections and complements as each day's BitMap is loaded.
3. Performance Comparison
On a cluster of ten 16‑core CPUs with 64 GB RAM, processing tens of millions of monthly active users using the 30‑day BitMap method achieved the performance shown in the figure below.
4. References
2016.04 “Programmer” – iFlytek’s Spark‑based user retention analysis and implementation.
http://www.infoq.com/cn/articles/the-secret-of-bitmap/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
iFlytek Mobile Internet Technology Team
Welcome! Follow iFlytek's Mobile Internet Technology Team for sharing technical practices, lessons learned, and behind-the-scenes stories of mobile internet product development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
