Designing a Scalable Real‑Time Data Warehouse with Redis: Challenges and Solutions
The article analyzes the massive storage and performance challenges of a real‑time DMP cache built on Redis, outlines data characteristics and technical obstacles, and proposes eviction policies, bucket‑based hashing, and fragmentation‑reduction techniques with Java code examples to achieve billion‑scale in‑memory key‑value storage.
The author discusses a real‑time data warehouse scenario for a DMP that must store billions of mapping relationships between third‑party IDs (cookies, IMEI, IDFA) and a unified super‑ID, along with demographic tags, while providing millisecond‑level query latency.
Because offline storage on HDFS can handle the volume, the real challenge lies in keeping all data in memory: the key‑value set easily exceeds 5 billion entries, requiring over 1 TB of RAM, and traditional replication would further inflate memory consumption.
Data characteristics include short keys/values, highly variable cookie lengths, and a daily influx of billions of new mappings, making it impossible to rely on warm‑data pre‑loading.
The technical challenges identified are memory fragmentation due to variable‑length keys, high pointer‑induced memory bloat (up to 7×), unpredictable hot‑data patterns, strict latency requirements (<100 ms on public networks), long data retention (≥35 days), and the high cost of storing hundred‑billion‑scale keys.
To address these, the article proposes three main solutions:
1. Eviction Strategy – aggregate logs in HBase, set a 35‑day TTL, and use Redis key expiration with renewal on access to automatically discard cold IDs while retaining hot ones.
2. Reducing Memory Expansion – replace direct keys with fixed‑length bucket IDs generated by hashing the original key (e.g., MD5) and store the actual key‑value pairs inside a hashmap under that bucket. This can collapse the number of Redis keys by over 90 % when ~10 keys share a bucket.
The Java implementation for generating a bucket ID is shown below:
public static byte[] getBucketId(byte[] key, Integer bit) {
MessageDigest mdInst = MessageDigest.getInstance("MD5");
mdInst.update(key);
byte[] md = mdInst.digest();
byte[] r = new byte[(bit-1)/7 + 1]; // 7 usable bits per byte for ASCII
int a = (int) Math.pow(2, bit%7) - 2;
md[r.length-1] = (byte)(md[r.length-1] & a);
System.arraycopy(md, 0, r, 0, r.length);
for(int i=0;iChoosing a 33‑bit bucket space yields about 2^30 buckets, allowing each bucket to hold roughly ten key‑value pairs, which meets the target of storing hundred‑billion‑scale data with a manageable number of Redis keys.
3. Reducing Fragmentation – store keys with equal length (fixed‑size bucket IDs) and truncate device IDs to their last six characters to improve memory alignment; use lightweight value encoding (three bytes for age, gender, geo). Additionally, occasional master‑slave failover can compact memory, and specialized allocators like tcmalloc or jemalloc can further reduce fragmentation.
Overall, the proposed architecture combines TTL‑based eviction, bucket‑hashing, and memory‑alignment techniques to enable an in‑memory, low‑latency DMP cache capable of handling billions of records efficiently.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.