Applying BitMap Indexing with HBase for Precise Marketing in Big Data
This article details a big‑data precise‑marketing solution that leverages HBase storage and Roaring BitMap indexing to efficiently handle billions of user records, describing project background, technology selection, architecture, partitioning strategy, and coprocessor implementation for fast multidimensional queries.
The presentation, originally delivered by Mr. He Liangjun at the DataFunTalk technical salon, describes a precise‑marketing project that processes billions of user accounts and tens of millions of user‑profile tags. The goal is to enable millisecond‑level query responses for online marketing scenarios such as the "Easy Customer Acquisition" service.
Given the massive data volume and the need for many dimensional tags without traditional OLAP measures, the team evaluated open‑source multi‑dimensional analysis tools (Kylin, Druid) but ultimately chose a custom HBase + BitMap solution. HBase provides scalable column‑oriented storage, while BitMap (implemented with Roaring Bitmap) offers compact Boolean indexes for fast set operations.
Key architectural components include:
**Storage layer** – HBase clusters with active/master region servers, HFiles, and automatic region splitting.
**Computation layer** – HBase coprocessor endpoints (Endpoint) that execute parallel bitmap operations across regions.
**Index layer** – Roaring BitMap indexes built per tag, stored as HFile values; indexes are partitioned to keep bitmap sizes manageable.
**Routing & API layer** – Netty‑based HTTP services receive front‑end queries, translate tag selections into bitmap intersection/union expressions, and dispatch them to the appropriate region servers.
To handle the billion‑scale user IDs, IDs are first transformed into continuous integers and then partitioned (e.g., 200 partitions of 5 million IDs each). Each partition aligns with an HBase region, allowing the coprocessor to operate on a fixed‑size bitmap per region.
Table creation with pre‑defined splits is performed as follows:
create 'index', {METHOD => 'table_att', METADATA=>{'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy'}},{NAME=>'d',COMPRESSION => 'SNAPPY'}, SPLITS => ['0005000000','0010000000','0015000000','0020000000','0025000000','0030000000',...,'09950000000','10000000000']
During data preparation, user IDs are bucketed, tags are bucketed, and BitMap indexes are generated via MapReduce bulk‑load jobs, producing serialized BitMap objects stored in HFiles. The coprocessor then retrieves the relevant bitmap slices based on start‑key ranges, performs Boolean set operations (intersection, union) according to the query logic, and returns matching IDs to the client.
The solution demonstrates how combining HBase’s distributed storage with Roaring BitMap’s compressed, fast Boolean computation can meet the scalability, performance, and reliability requirements of large‑scale precise marketing applications.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.