Using RoaringBitmap for Efficient Storage and Computation of Massive User ID Sets in CDP Systems
This article explains how a CDP system tackles the storage and set‑operation challenges of billions of user‑ID tags and groups by adopting bitmap techniques, especially RoaringBitmap, to dramatically reduce space usage and enable fast union, intersection, and difference calculations.
1. Background
In the CDP system there are thousands of tags and tens of thousands of groups, each consisting of user‑ID collections; many tags contain billions of IDs, and groups can reach tens of billions of IDs, creating a storage and computation challenge.
2. Problem Description
Storing these massive ID collections requires a storage structure capable of handling hundreds of gigabytes; for example, a group with 10 million IDs would need about 150 MB, and 40 such groups would exceed 60 GB, which is unacceptable when combined with tag data.
Beyond storage, performing set operations (union, intersection, difference) on tags and groups to generate finer‑grained ID packs is also challenging.
3. Solution
The CDP adopts a bitmap‑based approach, which reduces storage space and enables efficient bitwise set operations.
3.1 Bitmap Overview
A bitmap uses a single bit to represent the presence of a value, allowing each datum to occupy only one bit and dramatically saving space. For example, storing the array [2,4,6,8] with a Java byte array would require 32 bits, whereas a bitmap can represent it with a single byte.
3.2 User‑ID Pool Encoding
Each user ID is mapped to a unique offset (a 64‑bit integer), stored in an ID‑pool table; new IDs receive sequential offsets.
To create a tag or group containing only even IDs, set the bits corresponding to the offsets of those IDs to 1.
3.3 Issues with Java BitSet
Java’s BitSet cannot handle offsets beyond 2³²‑1, and sparse data leads to large wasted space because BitSet expands to the highest offset.
public void set(int bitIndex) {
if (bitIndex < 0)
throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);
int wordIndex = wordIndex(bitIndex);
expandTo(wordIndex);
words[wordIndex] |= (1L << bitIndex); // Restores invariants
checkInvariants();
}
private void expandTo(int wordIndex) {
int wordsRequired = wordIndex + 1;
if (wordsInUse < wordsRequired) {
ensureCapacity(wordsRequired);
wordsInUse = wordsRequired;
}
}
private void ensureCapacity(int wordsRequired) {
if (words.length < wordsRequired) {
int request = Math.max(2 * words.length, wordsRequired);
words = Arrays.copyOf(words, request);
sizeIsSticky = false;
}
}3.4 RoaringBitmap Compression
RoaringBitmap provides a compressed bitmap implementation that splits 32‑bit integers into high and low 16‑bit parts, storing low bits in containers and high bits as container identifiers.
protected static char highbits(int x) {
return (char) (x >>> 16);
}
protected static char lowbits(int x) {
return (char) x;
}For 64‑bit IDs, RoaringBitmap offers Roaring64NavigableMap, which separates high 32 bits and low 32 bits.
<dependency>
<groupId>org.roaringbitmap</groupId>
<artifactId>RoaringBitmap</artifactId>
<version>0.8.13</version>
</dependency> public void add(long... dat) {
for (long oneLong : dat) {
addLong(oneLong);
}
}
public void addLong(long x) {
int high = high(x);
int low = low(x);
// ………
}
public static int high(long id) {
return (int) (id >>> 32);
}
public static int low(long id) {
return (int) id;
}4. Current Status and Outlook
All CDP tags and groups now use RoaringBitmap for storage; set operations are performed via bitmap methods to generate finer‑grained audiences. Future work will focus on transforming raw warehouse data into tags and groups.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.