Big Data 11 min read

How BitMap Storage Boosts Event Analysis Performance in Big Data Platforms

This article explains GrowingIO's event analysis data model, the challenges of metric‑dimension calculations on massive datasets, and how a BitMap‑based vertical storage and dimension‑combination numbering dramatically improve query efficiency and scalability.

GrowingIO Tech Team

Jul 9, 2020

How BitMap Storage Boosts Event Analysis Performance in Big Data Platforms

1. From a Data Requirement

GrowingIO processes billions of user‑behavior events daily. The "event analysis" module lets users flexibly combine dimensions to view metrics, but raw SQL queries become slow as data grows.

2. Data Modeling Challenges

Metrics (e.g., user count, page views) and dimensions (time, city, browser, etc.) must be combined efficiently. Traditional horizontal storage and simple caching struggle with large‑scale, ad‑hoc queries.

3. Solutions Overview

Data Layering : Pre‑compute frequent query results into offline tables (e.g., "last 7 days‑region‑device metric table").

Data Pre‑aggregation : Use materialized views, cubes, or segments to trade space for speed.

These approaches reduce compute load but still generate many one‑time tables.

4. Optimized Storage Model Based on BitMap

To enable flexible dimension‑metric combinations, dimensions are stored vertically and separated from metrics.

4.1 Vertical Dimension Storage (User Count)

Each dimension value is stored in its own column, allowing arbitrary combination queries such as "region: Beijing" and "device: Mac".

4.2 Metric Storage (Visit Count)

Metrics are stored as a Map<Int, BitMap> where the key is the count and the value is the set of users with that count.

4.3 Binary Key Optimization

Instead of decimal keys, counts are represented in binary, storing users in BitMaps corresponding to set bits, dramatically reducing the number of keys.

5. Handling Multi‑Dimension Combinations

When a user has multiple dimension values (e.g., Beijing + Windows), separating dimensions loses the combination relationship. To fix this, each unique dimension combination is assigned a sequential ID, and storage becomes Map<Short, BitMap> where the key is the combination ID.

6. Performance Comparison

Benchmarks on a SparkSQL cluster (local[16], 4 GB RAM) versus single‑threaded BitMap computation show that SparkSQL time grows linearly with data size, while BitMap remains relatively stable.

7. Conclusion

BitMap provides both storage compression and fast set operations, making it suitable for large‑scale ID collections and enabling use cases such as cohort calculation, tagging, funnel analysis, retention, and user outreach without re‑computing groups.

Future work includes handling string‑based IDs, improving distributed BitMap performance, addressing high‑cardinality dimensions, and designing a SQL‑like language for this model.

performance optimization Big Data Bitmap event analysis

Written by

GrowingIO Tech Team

The official technical account of GrowingIO, showcasing our tech innovations, experience summaries, and cutting‑edge black‑tech.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.