How RoaringBitmap Transforms Massive User ID Storage in CDPs
This article explains how a CDP tackles billions‑scale user ID tags and groups by replacing naïve text‑file storage with bitmap techniques, detailing Bitmap basics, encoding strategies, Java BitSet limitations, and the adoption of RoaringBitmap for efficient compression and fast set operations.
1. Background
In the CDP system, there are thousands of tags and over 20,000 groups, each consisting of massive user ID collections, many exceeding billions of IDs. As the user ID pool grows, storing and computing combinations of these tags and groups becomes a major challenge.
2. Problem Description
Storing billions of user IDs for tags and groups demands high‑capacity storage structures. For example, a group with 10 million IDs would need about 150 MB in a text file, and 40 such groups would require roughly 60 GB, which is unacceptable when the number of groups reaches tens of thousands.
Beyond storage, generating finer‑grained ID subsets for tag‑group calculations also poses difficulties.
3. Solution
The CDP adopts a bitmap approach, which not only reduces storage space but also enables efficient set operations (union, intersection, difference) for tag and group calculations.
1) Bitmap Introduction
A bitmap uses individual bits to uniquely represent values, allowing each data element to occupy a single bit and dramatically saving space. For an array [2,4,6,8], using a byte (8 bits) suffices, whereas storing the same values with Java byte or Integer would consume 32 bits or 128 bits respectively.
2) User ID Pool Encoding
Each user ID is mapped to a unique offset (a numeric index). The current offset space exceeds billions, so a table linking IDs to offsets is maintained, allowing new IDs to receive sequential offsets.
To create a tag containing only even IDs, set the bits corresponding to offsets 2, 4, 6, 8 to 1.
3) Encountered Issues
Java's BitSet cannot handle offsets beyond 2³²‑1, while many IDs exceed this range. Moreover, sparse groups (e.g., offsets 1 and 10,000,000) still require a bitmap of 10 million bits, wasting space.
Data sparsity leads to severe space waste.
public void set(int bitIndex) {
if (bitIndex < 0)
throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);
int wordIndex = wordIndex(bitIndex);
expandTo(wordIndex);
words[wordIndex] |= (1L << bitIndex); // Restores invariants
checkInvariants();
}
private void expandTo(int wordIndex) {
int wordsRequired = wordIndex + 1;
if (wordsInUse < wordsRequired) {
ensureCapacity(wordsRequired);
wordsInUse = wordsRequired;
}
}
private void ensureCapacity(int wordsRequired) {
if (words.length < wordsRequired) {
int request = Math.max(2 * words.length, wordsRequired);
words = Arrays.copyOf(words, request);
sizeIsSticky = false;
}
}To avoid waste, a compressed bitmap implementation is needed.
4) RoaringBitmap Compression
The solution uses RoaringBitmap, an efficient compressed bitmap library introduced in 2016 by S. Chambi, D. Lemire, O. Kaser, etc.
It splits 32‑bit integers into high 16 bits (container ID) and low 16 bits (data container). The high bits act as buckets, each holding many low‑bit containers.
protected static char highbits(int x) {
return (char) (x >>> 16);
}
protected static char lowbits(int x) {
return (char) x;
}For example, value 65538 is stored in bucket 1 at position 2.
Current RoaringBitmap version 0.8.13 provides three container types: ArrayContainer, BitmapContainer, and RunContainer.
Because the ID pool exceeds 32‑bit limits, the 64‑bit Roaring64NavigableMap is used, following the same high/low split concept.
Dependency:
<dependency>
<groupId>org.roaringbitmap</groupId>
<artifactId>RoaringBitmap</artifactId>
<version>0.8.13</version>
</dependency> public void add(long... dat) {
for (long oneLong : dat) {
addLong(oneLong);
}
}
public void addLong(long x) {
int high = high(x);
int low = low(x);
// ...
}
public static int high(long id) {
return (int) (id >>> 32);
}
public static int low(long id) {
return (int) id;
}Bitmap operation methods are illustrated in the following diagram:
5. Current Status and Outlook
All CDP tags and groups now use RoaringBitmap for storage, enabling fast set operations to generate finer‑grained audiences. The next step is to transform detailed warehouse data into raw tags or groups, which will be covered in a future article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
