Big Data 19 min read

Leveraging Bitmap for High‑Performance Multi‑Dimensional Analytics in Big Data

This article explains how bitmap data structures, combined with compression and in‑memory techniques, enable fast, flexible, and scalable multi‑dimensional analytics for large‑scale data platforms, addressing historical marketing inefficiencies and outlining future directions such as memory‑mapped files and distributed bitmap computation.

Art of Distributed System Architecture Design

Aug 22, 2015

Leveraging Bitmap for High‑Performance Multi‑Dimensional Analytics in Big Data

Background

a) Historical Confusion

Every technical point is backed by a business story; the author uses the Bitmap case to illustrate a long‑standing marketing problem: half of advertising spend is wasted, but marketers cannot identify which half, largely due to insufficient understanding of the target audience.

b) Dawn of the Era

Rapid advances in information technology—big‑data processing frameworks such as Hadoop, Spark, and Storm—have opened a path to solve this problem by enabling deep attribute, behavior, and even psychological analysis of massive user populations.

c) Business Drivers

Business requirements continuously push for better tools:

Faster analysis: current jobs take hours or a full day, which cannot support rapid business adjustments.

More flexible analysis: beyond simple statistics, multi‑dimensional cross‑computations are needed.

Lightweight systems: large Hadoop clusters are costly and unaffordable for many.

Existing Technology Analysis

a) Two Sides of the Coin

Initially the team relied heavily on Hadoop, but soon realized Hadoop is not a silver bullet; its high‑throughput design conflicts with low‑latency needs, leading to the adoption of Storm and Redis for real‑time processing. The typical architecture combines batch Hadoop with real‑time streams.

Although this seems to resolve the throughput‑latency trade‑off, the core business problem remains: analysts still need multi‑dimensional cross‑analysis over massive data, which batch jobs take hours or days, while streaming can only handle current data and cannot rewind historical data.

b) White Swan: Traditional Tech Defects

Viewing the problem as a “white swan” reveals two major technical obstacles: high I/O overhead (disk and network) caused by row‑oriented storage and the need to scan entire files, and inefficient computation where CPU and memory are spent on data loading and filtering rather than business logic.

Locate relevant files based on query.

Extract files from the file system, incurring disk I/O.

In distributed systems like Hadoop, incur costly network transfer.

Read rows sequentially, many of which contain irrelevant information, leading to wasted resources.

The next‑generation computation model focuses on reducing overall I/O and directing compute resources to truly valuable operations.

c) From White Swan to Black Swan

Three tactics are employed to overcome the above drawbacks:

Pre‑processing to minimize inefficient file reads.

Compression using high‑ratio algorithms that do not affect calculation precision.

In‑memory computation: load the entire dataset into memory once, allowing the CPU to focus on business calculations.

The “black swan” that unifies these advantages is the Bitmap data structure.

Bitmap Secrets

a) How Bitmap Enables Multi‑Dimensional Cross Computation

A bit is the smallest unit of data (0 or 1). A bitmap stores information as an array of bits, which is extremely space‑efficient. For example, attendance for eight employees can be represented by a single byte where each bit indicates presence.

Because bits support fast logical operations (AND, OR, XOR, NOT), multiple bitmaps can be combined to perform multi‑dimensional analysis. For instance, to find employees absent on two consecutive days, one can OR the two daily bitmaps and check for zeros.

Further AND operations can filter by additional attributes, such as gender.

b) How Bitmap Achieves High‑Speed Computation

Bitmap still incurs storage and compute waste: large bitmaps must be persisted (e.g., in a database or file) and loaded into memory, consuming I/O and memory. Compression techniques like gzip or lzo reduce storage size but add CPU overhead. The real solution is bitmap‑specific compression, typically based on Run‑Length Encoding (RLE). RLE encodes consecutive identical bits efficiently, dramatically shrinking sparse bitmaps. Compressed bitmaps can be stored aligned to word boundaries (e.g., 64‑bit) so that logical operations can be performed directly on the compressed representation, keeping memory usage low while maximizing CPU efficiency. Common schemes include BBC, WAH, and EWAH (used in Apache Hive).

c) Bitmap Capability in Big Data Computation

Using a TalkingData analytics example, bitmap can compute next‑day user retention efficiently. In Hive the query resembles: Performance tests show the bitmap‑based implementation outperforms traditional approaches by orders of magnitude.

d) Processing Flow After Introducing Bitmap

Data collection system gathers device uploads and forwards them to both real‑time and batch processors.

Real‑time processors (custom counters or Storm‑based) compute simple metrics and periodically update Redis/HBase; front‑end services query these stores for reporting.

Batch processors deduplicate data per user and generate/modify daily active‑user bitmaps, as well as attribute bitmaps (device model, region, OS, etc.).

Reporting queries retrieve the required bitmaps and combine them with AND/OR/NOT operations in memory to deliver multi‑dimensional analytics.

Future of the Black Swan

TalkingData plans to extend bitmap capabilities with faster, more convenient, and flexible data services, while tackling emerging challenges.

a) Memory‑Mapped Files

Even with optimized compression, loading bitmap files into memory remains a bottleneck. The team previously stored bitmaps in MySQL, but as data grew, query latency was dominated by MySQL (≈1 ms per query). Memory‑mapped files map a disk file directly into the process address space, eliminating explicit I/O and allowing the operating system to manage caching, thus greatly improving throughput for massive bitmap files.

b) Distributed Bitmap Computation

When analyses require tens or hundreds of thousands of bitmaps (e.g., multi‑year metrics), a single node cannot hold all data in memory. The next‑generation engine will partition the workload across many machines using a fork‑join or distributed model, then merge partial results. Bitmap operations correspond to set algebra (AND/OR/NOT). For example, to find users who visited the US, visited China, and use iOS, one can compute (A∩B)∩C or, in a distributed fashion, (A∪C)∩(B∪C) on separate nodes and combine the results. Challenges include data skew, task splitting optimization, and load balancing under high concurrency. Conclusion Reflecting on a 30‑year‑old paper about astronomers using cardinality (bitmap) calculations, the author notes that modern product analytics—tracking user frequency and temporal patterns—are a commercial re‑application of classic computational science. Disclaimer: The content originates from public internet sources; the author remains neutral and provides it for reference and discussion only. Copyright belongs to the original authors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Bitmap Distributed Computing compression Multi-dimensional Analytics

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.