Big Data 18 min read

Unlocking Massive Data Efficiency: How Bitmap and RoaringBitmap Transform Big Data Storage

This article explains the principles, Java implementation, and performance benefits of Bitmap and RoaringBitmap, demonstrating how they dramatically reduce storage costs, enable fast deduplication and set operations, and optimize large‑scale data warehouse queries in real‑world scenarios.

Huolala Tech
Huolala Tech
Huolala Tech
Unlocking Massive Data Efficiency: How Bitmap and RoaringBitmap Transform Big Data Storage

Introduction

In the era of big data, improving decision‑making, insight discovery, and process optimization requires efficient, reliable, and rich data generation using limited resources. Huolala’s data warehouse team explores cost‑saving and performance‑boosting techniques, such as Bitmap, to enhance downstream product efficiency.

About Bitmap

Bitmap (bit map) is a classic data structure that stores bits to represent the presence of integers, effectively solving large‑scale deduplication problems. Each bit corresponds to a number: 1 indicates presence, 0 indicates absence.

What

For example, the set {1,3,5,7} would normally require four 32‑bit integers (16 bytes). Using Bitmap, only 8 bits (1 byte) are needed, with each bit position representing a number.

Simple Bitmap Implementation in Java

import java.util.Arrays;

public class BitMap {
    // byte array to store bits
    private byte[] bits;
    // length of the bitmap
    private int bitSize;
    // constructor
    public BitMap(int bitSize) {
        this.bitSize = (bitSize >> 3) + 1;
        bits = new byte[(bitSize >> 3) + 1];
    }
    // add a number
    public void add(int num) {
        int arrayIndex = num >> 3;
        int position = num & 0x07;
        bits[arrayIndex] |= 1 << position;
    }
    // check existence
    public boolean contain(int num) {
        int arrayIndex = num >> 3;
        int position = num & 0x07;
        return (bits[arrayIndex] & (1 << position)) != 0;
    }
    // clear a number
    public void clear(int num) {
        int arrayIndex = num >> 3;
        int position = num & 0x07;
        bits[arrayIndex] &= ~(1 << position);
    }
    // print underlying bits
    public static void printBit(BitMap bitMap) {
        int index = bitMap.bitSize & 0x07;
        for (int j = 0; j < index; j++) {
            System.out.print("byte[" + j + "] = ");
            byte num = bitMap.bits[j];
            for (int i = 7; i >= 0; i--) {
                System.out.print((num & (1 << i)) == 0 ? "0" : "1");
            }
            System.out.println();
        }
    }
    // demo
    public static void main(String[] args) {
        BitMap bitmap = new BitMap(3);
        bitmap.add(3);
        System.out.println("Inserted 3");
        System.out.println("3 exists: " + bitmap.contain(3));
        printBit(bitmap);
        bitmap.clear(3);
        System.out.println("3 exists after clear: " + bitmap.contain(3));
        printBit(bitmap);
    }
}

Running the demo shows how bits change after insertion and clearing.

Why Use Bitmap

Key advantages of Bitmap include:

Massive storage cost reduction for large, unordered, non‑duplicate integer sets.

Natural deduplication because each position can hold only one value.

O(1) lookup time for existence checks.

Efficient set operations (union, intersection, difference) using bitwise logic.

However, Bitmap has limitations: it only stores non‑negative integers, is inefficient for sparse data, and cannot handle duplicate values.

Optimizing Bitmap

Map non‑numeric data (e.g., strings) to integers via hashing, possibly using multiple hash functions to reduce collisions.

Use compressed bitmap variants such as WAH, EWAH, or the widely adopted RoaringBitmap for sparse data.

RoaringBitmap Core Principle

RoaringBitmap splits a 32‑bit integer into high and low 16‑bit parts. The high 16 bits become a key identifying a container; the low 16 bits are stored inside the container. Three container types exist:

ArrayContainer – stores up to 4096 16‑bit values in a short array (sparse).

BitmapContainer – stores a fixed 65536‑bit bitmap (dense).

RunContainer – stores runs of consecutive values as start‑end pairs.

When a container exceeds 4096 elements, it automatically converts from ArrayContainer to BitmapContainer, achieving optimal space‑time trade‑offs.

How to Apply Bitmap

Bitmap can be a double‑edged sword: used correctly, it eliminates bottlenecks; used incorrectly, it wastes storage and accuracy.

Bitmap for User Segmentation

By converting each attribute (city_id, is_user_start, is_evl, is_order) into separate bitmaps, complex segment queries become simple bitwise AND operations. For example, the segment “city 1001 AND order = 1” translates to Bitmap[00000110] & Bitmap[00001010] = 1, which is far faster than scanning a wide table.

Bitmap in A/B Experiment Platform

Huolala’s A/B testing platform faced massive driver‑ID sets causing storage and compute pressure on the data warehouse and OLAP engine. By converting driver ID arrays into RoaringBitmap and feeding them to the OLAP layer, the end‑to‑end processing time was reduced by over two hours, and resource consumption dropped significantly.

Conclusion

This article covered a simple Bitmap implementation, dissected Java’s BitSet source, highlighted Bitmap’s strengths and weaknesses, introduced RoaringBitmap compression techniques, and presented practical use cases such as user segmentation and A/B testing, aiming to equip readers with actionable insights for large‑scale data processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

optimizationBig DataBitmapRoaringBitmapData Structures
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.