How BitMap Storage Boosts Event Analysis Performance in Big Data Platforms
This article explains GrowingIO's event analysis data model, the challenges of metric‑dimension calculations on massive datasets, and how a BitMap‑based vertical storage and dimension‑combination numbering dramatically improve query efficiency and scalability.
1. From a Data Requirement
GrowingIO processes billions of user‑behavior events daily. The "event analysis" module lets users flexibly combine dimensions to view metrics, but raw SQL queries become slow as data grows.
2. Data Modeling Challenges
Metrics (e.g., user count, page views) and dimensions (time, city, browser, etc.) must be combined efficiently. Traditional horizontal storage and simple caching struggle with large‑scale, ad‑hoc queries.
3. Solutions Overview
Data Layering : Pre‑compute frequent query results into offline tables (e.g., "last 7 days‑region‑device metric table").
Data Pre‑aggregation : Use materialized views, cubes, or segments to trade space for speed.
These approaches reduce compute load but still generate many one‑time tables.
4. Optimized Storage Model Based on BitMap
To enable flexible dimension‑metric combinations, dimensions are stored vertically and separated from metrics.
4.1 Vertical Dimension Storage (User Count)
Each dimension value is stored in its own column, allowing arbitrary combination queries such as "region: Beijing" and "device: Mac".
4.2 Metric Storage (Visit Count)
Metrics are stored as a Map<Int, BitMap> where the key is the count and the value is the set of users with that count.
4.3 Binary Key Optimization
Instead of decimal keys, counts are represented in binary, storing users in BitMaps corresponding to set bits, dramatically reducing the number of keys.
5. Handling Multi‑Dimension Combinations
When a user has multiple dimension values (e.g., Beijing + Windows), separating dimensions loses the combination relationship. To fix this, each unique dimension combination is assigned a sequential ID, and storage becomes Map<Short, BitMap> where the key is the combination ID.
6. Performance Comparison
Benchmarks on a SparkSQL cluster (local[16], 4 GB RAM) versus single‑threaded BitMap computation show that SparkSQL time grows linearly with data size, while BitMap remains relatively stable.
7. Conclusion
BitMap provides both storage compression and fast set operations, making it suitable for large‑scale ID collections and enabling use cases such as cohort calculation, tagging, funnel analysis, retention, and user outreach without re‑computing groups.
Future work includes handling string‑based IDs, improving distributed BitMap performance, addressing high‑cardinality dimensions, and designing a SQL‑like language for this model.
GrowingIO Tech Team
The official technical account of GrowingIO, showcasing our tech innovations, experience summaries, and cutting‑edge black‑tech.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
